Achieving interoperability is essential to maximize the impact of the RDC by fostering collaboration, and ensuring the reproducibility of research findings. The Mediation Layer provides community-agnostic functionality for creating high-quality data products. It includes tools and standards for ensuring that datasets from different sources and domains can be used collaboratively. This involves standard metadata anad data formats, tools for managing metadata, transforming data from a technical data model into a semantic model, as well as improving data quality.
Zooming into the Mediation Layer
Semantic Web
Terminologies are structured knowledge bases that provide a common language and a framework for mapping and harmonizing data from diverse sources enabling their transformation into high quality data products. They help ensure that different datasets can be understood and efficiently integrated. Using domain-specific terminologies enables semantic interoperability, which means that different systems can communicate and exchange information with a shared understanding of the datasets meaning. This is particularly important in heterogeneous environments and when integrating data from various sources. A key component of the RDC Mediation Layer is our terminology repository and service called BiodivPortal, a repository supporting the management, sharing and use of biodiversity related terminologies. BiodivPortal provides a centralized storage for terminologies and offers various functionalities for their managment through both its user interface and its API. Other components of the RDC can perform semantic enrichment by reusing the set offered semantic services or by directly accessing and integrating terminologies.
(Meta)datata Standards
Establishing standardized data formats and metadata is a fundamental step to ensure that data shared within the RDC are structured in a consistent and understandable manner. Commonly used data standards in research data commons include data exchange formats like JSON, metadata and data standards. Our goal is to build services based on well-established, internationally agreed semantic standards spaning both data and metadata. More and more metadata standards are being offered in a Semantic Web compliant format as ontologies. We identified an initial subset of semantic core models, namely: Schema.org and ABCD3. We stored and published their corresponding ontologies on BiodivPortal. Additional standards for data are being collected and similarly stored, like for instance the Ecological Trait-data Standard (ETS).
(Meta)data Transformation
The mediation layer provides a set of transformation services and tools for data and metadata. Harmonizing metadata is critical for ensuring that different datasets can be effectively integrated. This involves mapping metadata elements from diverse sources to an agreed upon common schema. Tools for data transformation and conversion are needed to convert (meta)data from one format to another or to a common shared format, ensuring that data from various sources can be combined together. These tools may be domain-specific or general-purpose. In order to meet wider user needs, we developed the data collection service, that supports the conversion and storage of datasets into JSON, an easy-to-consume format. We are developing a tranformation pipeline for source metadata schemas and data formats into a semantic format based on the identified semantic core models and data standards.
Example Components
Perspectives
Data Integration
The mediation layer deals with data integration by facilitating the aggregation, harmonization, and sharing of data from multiple sources. This involves several mechanisms for harmonizing data from different sources and ensuring their consistency like the implemented tools for mapping and transforming data to a common format. Our primary goal for a meaningful data integration is to ensure semantic interoperability between datasets. This will be based on the set of selected semantic schemas both at the data and metadata level as well as the set of terminologies that define and enforce common data semantics. One key RDC pilot under development deals with data integration into a Knowledge Graph (KG) that will be stored in a triple store or a graph database, enabling users to query a intially heterogeneous data for research and analysis.
Curation and Harmonisation
Curation and harmonisation is achieved through metadata enrishment by annotating contextual information, controlled vocabularies and interlinks. This enhances data discoverability, comprehension and accuracy. We plan to implement tools for quality assessment to monitor and maintain data quality over time. This process will involve automatic checks, validation against standards as well as user feedback mechanisms. Semantic web technologies are based on formal semantics and provide us with a robust framework for data validation and consistency checks. They offer a set of reasoning mechanisms that can be applied to check data consistency. They also facilitate the development of validation rules to ensure data is conform to predefined standards and integrity constraints.