draw.io Diagram |
---|
border | true |
---|
| |
---|
diagramName | RDC-cloud-layer-detail |
---|
simpleViewer | false |
---|
links | auto |
---|
tbstyle | top |
---|
lbox | true |
---|
diagramWidth | 892 |
---|
revision | 4 |
---|
|
The Cloud Layer provides support for storage and compute resources. For the storage resources, we distinguish between primary object storage systems, with its reference implementation Aruna Object Storage, and secondary semantic storage systems, where dedicated data database systems are offered to manage the data products of the Semantic Layer. The compute resources consist of tools, such as Docker and Kubernetes, to facilitate the depolyment of services of the upper layers in the cloud.
Primary Object Storage - Infrastructure-as-a-Service
The primary storage consists of a cloud-based object storage, in our case the Aruna Object Storage (AOSAOS), developed at Justus Liebig University Giessen (JLU), is the only implementation of an object storage in RDC. An object storage offers several advantages: As indicated by the name, an object storage organizes data in units of objects (not for example as files as known from a file system). Objects come with a unique have a system-generated unique identifier and metadata to describe the contents of the objects . It is particuarly applicable for unstructured data itemswith key-value pairs. The object creator is responsible for defining access credentials and attaching suitable metadata that describe the contents of unstructured data like text, images, audio, and video or and semi-structured items data expressed in JSON format . In order to to support finding objects. To retrieve objects from the object storage, there are two options. One is to use the unique identifier to retrieve the associated object and the other makes use of employs search labels returning all objects with metadata matching one or multiple search labels. Second, one of the key functionality of an object storage is the possibility to use so-called RESTful APIs or HTTP to to retrieve the objects from the storage via HTTP requests. Thus, it is possible to use either an arbitrary programming language or simply a web browser to access the data. While object stores might be located on an ordinary desktop computer, an object storage is distributed among nodes in the cloudit is more commonly used in distributed cloud infrastructures. This gives at least three advantages: First, the cloud offers a near nearly unlimited storage capacity. Second, it is possible to distribute objects among multiple nodes such that access to objects and processing can be parallelized resulting in large performance improvements. Third, the data objects are replicated to avoid the problem of system failures. When one of the nodes in the cloud is out of service, there will be other nodes offering a copy of the objects.
The object storage of a research infrastructure like RDC gives another advantage by sharing objects among many users. Instead of managing an individual copy of an object for each of the users, it is now possible to share the object among all users with suitable access credentials.Consider for example a picture collection of dragonflies that might be of interest to many communities, but the locations of threaded species are visible only to a few experts. Moreover, domain scientists do not have to care about the management of systems, but simply use the unified access interface of the object storage of RDC or the higher-level services on top of the services of the Cloud Layer.
In general, object storage requires that objects are static and updates of an object are seldom. However, in biodiversity, there are also highly dynamic data sets such as time series that need to be updated continuously. To address this issue, RDC introduces a version concept that manage manages static versions, also known as snapshots, of a data set, where one of them is the current version. Instead of applying updates to the actual version immediately, it is possible to collect updates and apply them once when the next version of the data set is created. Versions of data sets are then managed for processing and never deleted from the object store. Thus, such a version approach is also supports important for reproducibility reasons of research results.
Cloud Computing - Platform as a Service
The RDC is not limited to storage only , but strives for offering scalable compute to offer scalable cloud-based computing services that are easy to use based on intuitive low-code interfaces.
We aim to offer these low-code services in the Mediation Layer and are currently implementing them using the popular and powerful open-source Apache Spark infrastructure as a PaaS (Platform as a Service) for data transformation related tasks. Building on such infrastructure lets us benefit from other existing developments on top of it so we can avoid building things from scratch and focus our efforts on specific challenges present in the RDC.For enabling tools like Apache Spark to run in the cloud in a distributed and scalable fashion, we build on Kubernetes for managing compute resources available on cloud providers like de.NBI. In order to remove the complexity of deploying Apache Spark in a Kubernetes cluster for end-users, we want to offer Apache Kyuubi as a distributed JDBC/ODBC gateway service. This enables the submission of Spark tasks in SQL-like query syntax and Kyuubi handles the creation and management of Spark compute instances in the Kubernetes cluster. At the same time, it enables easy integration into other tools, since only a JDBC/ODBC client is required for interfacing with the service. For example, on the Semantic Layer, we offer a Jupyter Notebooks service, from which data transformations can then seamlessly be triggered to run on Spark, offering the power of the cloud to our users.deploy and maintain. Examples of such services are specific database systems like PostgreSQL and Elasticsearch that are used for managing the data products offered in the Semantic Layer or specific software tools to create a workflow and to check the data quality provided in the Mediation Layer. While these services will be in the upper layers of the RDC, it is important to offer basic infrastructure tools within the Cloud Layer. Probably the most essential tool is Docker supporting the containerization of services. A Docker container is a standalone, and executable software package that includes everything needed to run a service (within a cloud infrastructure like de.NBI). It encapsulates the dependencies of a service (avoiding version mismatches) and isolates it from the underlying hardware. In addition, Kubernetes and OpenStack support the orchestration and management of Docker containers within a cloud infrastructure. For example, they are responsible for resource utilization, load balancing, and scalability.Example Components