Cloud Layer

Link to Overview

Zooming into the Cloud Layer

draw.io

Source page access restriction: Click the link below to check if the page is accessible.
/display/NFDI/Research+Data+Commons+%28RDC%29+Overview

The Cloud Layer is the technical backbone based on a multi-cloud infrastructure including for example the de.NBI cloud and GDWG. These clould providers offer scalable functionality for distributed computing as well as cloud storage with near infinite resources such that users are empowered to run compute-intensive jobs or analyze very large data sets in a user-friendly way. The layer is organized in three components called Primary Storage, Secondary Storage and Cloud Computing that we will introduce in the following paragraphs.

Primary Storage - Infrastructure-as-a-Service

The primary storage consists of a cloud-based object storage, the Aruna Object Storage (AOS), developed at JLU. A detailled describtion of the technology will be given in an extra document. In this paragraph, we will summarize the advantages of such an object storage in RDC. As indicated by the name, an object storage organizes data in units of objects (not as files as known from a file system). Objects come with an unique identifiert and metadata to describe the contents of the objects. It is particuarly applicable for unstructured data items like text, images, audio and video or semi-structured items expressed in JSON. In order to retrieve objects, there are two options. The one is to use the unique identifier to retrieve the associated object and the other makes use of search labels returning all objects matching one or multiple search labels. One of the key functionality of an object storage is possibility to use so-calleds RESTful APIs or HTTP to retriev the objects from the storage. Thus, it is possible to use either an arbitrary programming language or simply a Web browser to issue a request. While object stores might be located on an ordinary desktop computer, an object storage is distributed among nodes in the cloud. This gives at least three advantages. First, the cloud offeres a near unlimited storage capacity. Second, it is possible to distribute objects among multiple nodes such that access to objects and processing can be parallelized resulting in large performance improvements. Third, the data objects are replicated to avoid the problem of system failuers. When one the node in the cloud is out of service, there will be other nodes offering a copy of the objects.

The object storage of a research infrastructure like RDC gives another advantage by sharing objects among many users. Instead of managing an individual copy of an object each of the users, it is now possible to share the object among all users. Moreover, domain scientists do not have to care about the management of systems, but simply use the unified access interface of the object storage of RDC.

In general, object storage requires that objects are static and updates of an object are seldom. However, there are highly dynamic data sets like time series in biodiversity. To address this issue, RDC introduces a version concept that manage static versions of data set, where one of them is the actual version. Instead of applying updates to the actual version immediately, it is possible to collect updates and apply them once when the next version of the data set is created. Versions of data sets are then managed for processing and never deleted from the object store. Thus, such a version approach also supports reproducibility of research results.

Secondary Storage - Platform as a Service

In contrast to the primary storage, the secondary storage offer data systems as a service that are administrated in the cloud. For example, users are often interested in using a relational database system like PostgreSQL with its native SQL-interface. For example, some of the tools use a Postgres database to manage spatial data. Instead of set up such a service locally, we offer such kind of services for users to create databases. In particular, we see the necessity for such a service for the management of data products that are created in the Semantic Layer. These data products often come withe a structured or a semi-structured data model (e. g. , JSON). Then, it would be beneficial to use an appropriate data service rather than hosting a database system on a local machine. In addition, often cloud-ready systems exist and then the scalability of the cloud would give a performance advantage. So far, we have not implemented such services, but we plan to support various data services like PostgreSQL, NoSQL database systems and triple stores as these systems are already required within the RDC.

Cloud Computing

Short description

Our Goals
- Unified physical data model in primary Cloud storage
- Universal interfaces for accessing data
- Scalable technical cloud-services for community
Approaches
- HTTP-rest Interface (e.g., S3)
- JSON-based data model
- NoSQL database system
- Dedicated DBMSs to support semantic storage
- Scalable processing tools, e.g. Apache Spark, Hadoop

Associated Services:

Space shortcuts

Page tree

Zooming into the Cloud Layer

Primary Storage - Infrastructure-as-a-Service

Secondary Storage - Platform as a Service