Page History

The Cloud Layer of the RDCis the technical backbone based on a multi-cloud infrastructure, notably consisting of the de.NBI cloud and GDWG. It provides near-infinite storage and storage services such as Aruna Object Storage (AOS), which manages the RDC's raw datasets. In addition, the layer provides basic compute services to the upper layers to facilitate applications to run in the cloud. The cloud layer is not designed for the end user but for those who implement and deliver the services on the upper layers.

Talk

id	talk-1060

draw.io Diagram

border	true
diagramName	RDC-cloud-layer
simpleViewer	false
links	auto
tbstyle	top
lbox	true
diagramWidth	872
revision	3

Zooming into the Cloud Layer

draw.io Diagram

border	true

diagramName	RDC-cloud-layer-detail
simpleViewer	false
links	auto
tbstyle	top
lbox	true
diagramWidth	892
revision	4

The Cloud Layer provides support for storage and compute resources. For the storage resources, we distinguish between primary storage, with its reference implementation Aruna Object Storage, and secondary storage, where dedicated data systems are offered to manage the data products of the Semantic Layer. The compute resources consist of tools, such as Docker and Kubernetes, to facilitate the depolyment of services of the upper layers in the cloud.

Primary Storage - Infrastructure-as-a-Service

The primary storage consists of a cloud-based object storage, in our case the Aruna Object Storage (A

Talk

id	talk-1061

OS), developed at Justus Liebig University Giessen (JLU). An object storage offers several advantages: As indicated by the name, an object storage organizes data in units of objects (not for example as files as known from a file system). Objects have a system-generated unique identifier and metadata to describe the contents of the objects
Talk
id talk-1069
Talk
id talk-1068
. The object creator is responsible to attach for defining access credentials and attaching suitable metadata that describe the contents of unstructured data data
Talk
id talk-1071
like text, images, audio and video or semi-structured data expressed in JSON format and support finding objects. In order to retrieve objects from the object storage, there are two options. One is to use the unique identifier to retrieve the associated object and the other makes use of search labels , attached to the objects as metadata, returning all objects with metadata matching one or multiple search labels.
Talk
id talk-1067
Second, one of the key functionality of an object storage is the possibility to use so-called RESTful APIs or HTTP to retrieve the objects from the storage. Thus, it is possible to use either an arbitrary programming language or simply a web browser access the data. While object stores might be located on an ordinary desktop computer, an object storage is distributed among nodes in the cloud. This gives at least three advantages: First, the cloud offers a near unlimited storage capacity. Second, it is possible to distribute objects among multiple nodes such that access to objects and processing can be parallelized resulting in large performance improvements. Third, the data objects are replicated to avoid the problem of system failures. When one of the nodes in the cloud is out of service, there will be other nodes offering a copy of the objects.
The object storage of a research infrastructure like RDC gives another advantage by sharing objects among many users. Instead of managing an individual copy of an object for each of the users, it is now possible to share the object among all users with suitable access credentials.
Talk
id talk-1072
Consider for example a picture collection dragonflies that is of interest for different communities. Moreover, domain scientists do not have to care about the management of systems, but simply use the unified access interface of the object storage of RDC or the higher-level services of the upper RDC layers.
In general, object storage requires that objects are static and updates of an object are seldom. However, in biodiversity there are highly dynamic data sets such as time series that need to be updated continuously. To address this issue, RDC introduces a version concept that manage static versions, also known as snapshots, of a data set, where one of them is the current version. Instead of applying updates to the actual version immediately, it is possible to collect updates and apply them once when the next version of the data set is created. Versions of data sets are then managed for processing and never deleted from the object store. Thus, such a version approach also supports reproducibility of research results.
Cloud Computing - Platform as a Service
The RDC is not limited to storage only, but strives for offering scalable cloud-based compute services that are easy to use based on intuitive low-code interfaces like an intuitive workflow tool to create a data product.
Talk
id talk-1076
We aim to offer these low-code services in the Mediation Layer and are currently implementing them using the popular and powerful open-source Apache Spark
Talk
id talk-1075
infrastructure as a PaaS (Platform as a Service) for data transformation related tasks. Building on such infrastructure lets us benefit from other existing developments on top of it so we can avoid building things from scratch and focus our efforts on specific challenges present in the RDC.
For enabling tools like Apache Spark to run in the cloud in a distributed and scalable fashion, we build rely on Kubernetes for managing compute resources available on cloud providers and OpenStack to manage compute resources to avoid vendor lock-in to specific cloud providers. like de.NBI. In order to remove the complexity of deploying Apache Spark in a Kubernetes cluster for end-users, we want to offer Apache Kyuubi as a distributed JDBC/ODBC
Talk
id talk-1078
gateway service for a database API in Python. This enables the submission of Spark tasks in SQL-like query syntax and Kyuubi handles the creation and management of Spark compute instances in the Kubernetes cluster. At the same time, it enables easy integration into other tools, since only a JDBC/ODBC client is required for interfacing with the service. For example, on the Semantic Layer, we offer a Jupyter Notebooks service, from which data transformations can then seamlessly be triggered to run on Spark, offering the power of the cloud to our users.

Perspectives

Secondary Storage - Platform as a Service

Unlike primary storage, secondary storage provides data database systems as a service

Talk

id	talk-1079

that are managed in the cloud. For example, users are often interested in using a relational database system such as PostgreSQL with its native SQL interface. For example, some tools use a PostgreSQL database to manage geospatial data for species observations or environmental parameters. Instead of deploying such a service locally, we strive to provide such services for users to create databases. In particular, we see the need for such a service for managing data products created in the Semantic Layer. These data products often come with a structured or semi-structured data model (e.g. JSON). It would then be advantageous to use an appropriate cloud-based data service rather than hosting a database system on a local machine
Talk
id talk-1082
. This would also potentially open up enable data sharing among users and leverage collaborations who have the necessary access permissions and promote collaboration within a community. In addition, cloud-ready systems are often available, and the scalability of the cloud would provide a performance advantage. We have not yet implemented such services , but we plan to support various data services such as and ideally, these services are developed and managed by a dedicated consortium in Base4NFDI. We plan to initiate a auch a working group in Base4NFDI for hosting database system services like PostgreSQL, NoSQL database systems, and triple stores , as these systems are already required within the RDCin the cloud.

Associated Services:

Children Display

Space shortcuts

Page tree

Versions Compared

Old Version 38

New Version 39

Key

Zooming into the Cloud Layer

Primary Storage - Infrastructure-as-a-Service

Perspectives

Secondary Storage - Platform as a Service