Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Achieving interoperability is essential to maximize the impact of the RDC by in fostering collaboration , and ensuring the reproducibility of research findings. The MediationLayer provides community-agnostic functionality for creating high-quality data products. It includes tools and standards for ensuring that datasets from different sources and domains can be used collaboratively. This involves standard metadata anad and data formats, tools for managing metadata, transforming data from a technical data model into a semantic model, as well as improving data quality.


 

draw.io Diagram
bordertrue
diagramNameRDC-mediation-layer
simpleViewerfalse
linksauto
tbstyletop
lboxtrue
diagramWidth872
revision2

Zooming into the Mediation Layer

draw.io Diagram
bordertrue
diagramNameRDC-mediation-layer-detail
simpleViewerfalse
linksauto
tbstyletop
lboxtrue
diagramWidth892
revision5

Semantic Web

Terminologies are structured knowledge bases A terminology is a semantic resource that provide a common language and a framework for mapping and harmonizing data from diverse sources, enabling their transformation into high-quality data products. They help ensure that different datasets can be understood and efficiently integrated. Using domain-specific terminologies enables semantic interoperability, which means that different systems can communicate and exchange information with a shared understanding of the datasets' meaning. This is particularly important in heterogeneous environments and when integrating data from various sources. A key component of the RDC Mediation Layer is our terminology repository and service called BiodivPortal, a repository supporting the management, sharing and use of biodiversity-related terminologies. BiodivPortal provides a centralized storage for terminologies and offers various functionalities for their managment management through both its user interface and its API. Other components of the RDC, like the GFBio Search and the Elasticsearch Index can perform semantic enrichment by reusing the set of offered semantic services or by directly accessing and integrating provided terminologies.

(Meta)

datata

data Standards

Establishing standardized data formats and metadata is a fundamental step to ensure that data shared within the RDC are structured in a consistent and understandable manner. Commonly used data standards in research data commons standards within the RDC include data exchange formats like JSON, metadata and data standards. Our goal is to build services based on well-established, internationally agreed on semantic standards spaning spanning both data and metadata. More and more metadata standards are being offered in a Semantic Web-compliant format as ontologies. We identified an initial subset of semantic core models, namely: Schema.org and ABCD (versions 2.06 and 2.1, considering ABCD3 in the future). We published and stored and published their corresponding ontologies on in BiodivPortal. Additional standards for data are being collected and similarly stored, like for instance the Ecological Trait-data Standard (ETS). Those standards together with the identified set of terminologies will be used as the basis for a consistent semantic (meta)data integration. In order to qualify as ETS compliant, a trait data product requires at minimum the following content:

  1. a value (column traitValue) and a standard unit (traitUnit);
  2. a trait name (traitName) that links to a standardized definition (from a standard terminology);
  3. the scientific taxon name (scientificName) that links to an accepted taxon from a standard terminology.

In the following, we illustrate the required data structucture and linking to standard terminologies using a data excerpt from the TRY Plant Trait Database. The required minimum columns have been renamed according to the ETS standard requirements, standard units have been linked to the Quantities, Units, Dimensions and Data Types Ontology (QUDT), the trait names to the Thesaurus Of Plant characteristics (TOP) and the scientific names to the Catalogue of Life (COL).

Image Added

(Meta)data Transformation

The mediation layer provides a set of transformation services and tools for data and metadata. Harmonizing metadata is critical for ensuring that different datasets can be effectively integrated. This involves mapping metadata elements from diverse sources to an agreed-upon common schema. Tools for data transformation and conversion are needed to convert (meta)data either from one format to another or to a common shared format, ensuring that data from various sources can be effectively combined together. These tools may can be either domain-specific or generaldomain-purposeagnostic. In order to To meet wider user needs, we developed the data collection service, that supports the conversion and storage of datasets into JSON, an easy-to-consume format. We are developing a tranformation transformation pipeline for source source metadata schemas and data formats into a semantic format based on the identified semantic core models and data standards. This transformation makes use of the BiodivPortal annotation service to incorporate standard terms from widely used terminologies, enhancing data discovery and interoperability. Furthermore, we are working on data transformations, e.g., for the conversion and unification of measurement units and coordinate reference systems of data points within individual as well as across different datasets, which will improve the quality, usability, and interoperability of these datasets. We also plan to offer automatic transformations between different terminologies, utilizing mappings between them retrieved from BiodivPortal, for terms occurring in data points, which will help users understand and work with datasets using unfamiliar terminology.

Example Components

Children Display


Perspectives
Talk
idtalk-970

Data Integration

The mediation layer deals with data integration by facilitating the aggregation, harmonization, and sharing of data from multiple sources. This involves several mechanisms for harmonizing data from different sources and ensuring their consistency like the implemented tools for mapping and transforming data to a common format. Our primary goal for a meaningful data integration is to ensure semantic interoperability between datasets. This will be based on the set of selected semantic schemas both at the data and metadata level as well as the set of terminologies that define and enforce common data semantics. One key RDC pilot under development deals with data integration into a Knowledge Graph (KG) that will be stored in a triple store or a graph database, enabling users to query a intially initially heterogeneous data for research and analysis purposes. A first version of a Knowledge Graph of trait data from the TRY Plant Trait Database is now available, a part of the graph is diplayed using GraphDB Free.

Image Added

Curation and Harmonisation

Curation and harmonisation is are achieved through metadata enrishment enrichment by annotating contextual information, controlled vocabularies and interlinks. This enhances data discoverability, comprehension and accuracy. We plan to implement tools for quality assessment to monitor and maintain data quality over time. This process will involve automatic checks, validation against standards as well as user feedback mechanisms. Semantic web technologies are based on formal semantics and provide us with a robust framework for data validation and consistency checks. They offer a set of reasoning mechanisms that can be applied to check data consistency. They also facilitate the development of validation rules to ensure data is conform conforms to predefined standards and integrity constraints. Based on those, tools will be developed to supports support curation teams in their day-to-day work.


References

Schneider, & al. Towards an ecological trait-data standard. Methods in Ecology and Evolution (2019). https://doi.org/10.1111/2041-210X.13288