Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Info
The GFBio infrastructure centralizes biological data collection through an Elasticsearch index, enabling advanced real-time analysis and search capabilities. This index, pivotal for data discovery, is accessed via a user-friendly interface and an API, ensuring seamless integration with scientific databases. Focused on scalability and performance, it employs protocols like OAI-PMH for efficient data harvesting, supported by a robust server setup with Elasticsearch and NGINX. This approach ensures high accessibility and interoperability of biological data for researchers, facilitated by a structured "PanSimple" metadata format for precise data retrieval.

Overview

The GFBio (German Federation for Biological Data) harvesting infrastructure is designed to efficiently gather data on a daily basis from a variety of providers, primarily leveraging the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). This protocol is widely recognized for its effectiveness in facilitating the interoperable exchange of metadata over the internet, making it a cornerstone of GFBio's data collection strategy. To ensure a comprehensive coverage that accommodates the diverse nature of biological data, the infrastructure is also equipped to handle other specialized protocols, such as Access To Biological Collection Data (ABCD). ABCD is specifically tailored for the exchange of information about biological collections, representing a critical component of biodiversity informatics.

In its pursuit of optimizing data discovery, the GFBio infrastructure employs a unique metadata format known as "pansimplePanSimple". " This format is an enhanced version of the Dublin Core standard, developed by PANGAEAby PANGAEA – Data Publisher for Earth & Environmental Science, an information system for earth and environmental science (https://www.pangaea.de/). Pansimple extends the basic Dublin Core metadata elements to offer advanced search capabilities without incorporating usage metadata, which refers to how data about how datasets are accessed and used. This deliberate design choice ensures that the metadata structure remains focused on discovery.

The pansimple PanSimple format is specifically engineered to support basic metadata elements with structured extensions. These extensions include geographical coordinates, which are crucial for locating biological data in a spatial context, and typed enumerations for faceting. Faceting refers to the process of organizing search results into categories based on metadata attributes, thereby enhancing the user's ability to filter and find relevant datasets. By incorporating these features, the pansimple PanSimple format significantly improves the efficiency and precision of data discovery within the GFBio infrastructure, catering to the specific needs of the biological research community.

In summary, the GFBio harvesting infrastructure represents a sophisticated approach to data collection and discovery in the field of biological research. By leveraging the OAI-PMH protocol, accommodating special cases with protocols like ABCD, and employing the pansimple metadata format, it ensures that researchers have access to a rich, easily navigable repository of biological data.

Image Removed

Image Added
Fig1: The diagram showcases the architecture of the search index and harvesting infrastructure. At its core is the Elasticsearch Index, which aggregates metadata from various sources. The metadata is collected by specific harvesters, including panFMP and BMS-Harvester, which respectively gather data from OAI-PMH Providers and BioCASe Providers through the BioCASe Monitor Service (BMS). On the user-facing side, the Node Express Backend serves the GFBio Search application, providing access to the indexed data via a public API. The diagram highlights the general flow of data from gathering to search interface, illustrating the complex interactions and the pivotal role of Elasticsearch in enabling efficient search and retrieval. Note: This is not a technical documentation diagram and the connections deliberately do not reflect 100 Percently all data flow between the applications. 

LOGO (work in progress)

Status:
Panel
borderColor#7AF842
borderWidth2

Status: 

Status
colourGreen
titleProductive

Weblink: https://search.gfbio.org/

Target group: internal service 

Keywords: search, semantic search

RDC Integration: integrated

Product owner: GFBio e.V.




Getting started

To embark on utilizing the GFBio harvesting infrastructure, the initial step involves the installation for the foundation for its operation. This process is centered around configuring two critical components on a server that meets specific hardware prerequisites. The required server configuration is relatively robust, necessitating approximately 14 CPU cores and 32GB of RAM. This specification ensures that the server can handle the demands of processing and managing the extensive datasets typical in biological research.

The first component to be installed is Elasticsearch, a highly scalable open-source full-text search and analytics engine. Elasticsearch is instrumental in enabling the efficient storage, search, and analysis of large volumes of data in near real-time. Its role in the GFBio infrastructure is pivotal for supporting advanced data discovery and retrieval functionalities, making it easier for researchers to access and utilize biological data.

The second component is NGINX, which is employed as a proxy server. NGINX is renowned for its high performance, stability, rich feature set, simple configuration, and low resource consumption. In the context of the GFBio search infrastructure, NGINX acts as an intermediary for requests from clients seeking resources from the server. It enhances security, manages load balancing, and ensures efficient traffic handling, thereby contributing to the overall reliability and performance of the system.

User Guide

Accessing the Index: Overview

The Elasticsearch index, containing a wealth of metadata from various biological research data sources, is a critical asset of the GFBio infrastructure. However, to ensure security, data integrity, and the efficiency of data retrieval, direct user access to this index is restricted. Users interact with the index through higher-level interfaces designed to provide a user-friendly experience while maintaining the underlying data's complexity and structure.

Programmatic Access through an API

For users and developers looking to integrate GFBio data into their applications or workflows, the infrastructure offers programmatic access through an API. This API, a component of the search UI, allows for automated queries to the Elasticsearch index. It is designed for those who require a more flexible and customizable way to search and retrieve data, catering to specific research needs or application requirements.

The API provides a robust set of endpoints for querying the index, enabling users to specify search criteria, filter results, and access detailed metadata programmatically. This method is ideal for users who need to perform complex searches, automate data retrieval processes, or integrate GFBio data into third-party software or research tools.

Web-Based Application for User-Friendly Access

For users who prefer a more visual and interactive approach, the GFBio infrastructure offers a web-based application, also accessible through the search UI. This application provides a graphical interface for searching the Elasticsearch index, making it accessible to researchers, educators, and the general public without requiring technical expertise in API usage or query languages.

The web-based application simplifies the process of finding and accessing biological research data. Users can perform searches using intuitive forms and filters, browse search results, and view detailed metadata for individual records. This interface is designed to lower the barrier to accessing and utilizing the wealth of data available in the GFBio infrastructure, supporting a wide range of users from different backgrounds and with varying levels of technical proficiency.

RDC-Integration

Seamless Integration with the Research Data Commons

The Research Data Commons (RDC) is designed as a comprehensive ecosystem that facilitates the sharing, discovery, and reuse of research data across various scientific disciplines. At the heart of this ecosystem is the need for robust mechanisms that enable efficient search and retrieval of metadata, which is where the GFBio harvesting infrastructure plays a crucial role.

The Elasticsearch index, managed by the GFBio infrastructure, serves as a centralized repository of harvested documents, encompassing a wide array of metadata related to biological research. This index is made accessible through the GFBio search API, a powerful interface that allows for sophisticated querying capabilities. By leveraging this API, the integration with the RDC is realized, enabling users and systems within the RDC network to query and retrieve metadata stored in the Elasticsearch index.

Efficient Search and Retrieval of Metadata

The integration's core functionality is the efficient search and retrieval of metadata, which is critical for supporting the RDC's broader goals. The GFBio search API facilitates this by providing a flexible and powerful means to access the indexed data. Users can perform complex queries that can filter results based on various criteria, such as geographical location, species information, or any other metadata fields stored in the index. This capability ensures that researchers and other stakeholders can find the specific data they need quickly and efficiently.

Supporting Data Accessibility and Interoperability

The seamless integration between the GFBio harvesting infrastructure and the RDC through the search API significantly enhances data accessibility. By making it easier for researchers to find and access the data they need, this integration supports the broader scientific community's efforts to advance knowledge and foster innovation.

Moreover, interoperability is a key benefit of this integration. The use of standardized protocols and formats, such as those employed by the Elasticsearch index and the GFBio search API, ensures that data can be easily shared and used across different systems and applications within the RDC ecosystem. This interoperability is essential for collaborative research efforts, enabling data to be combined and analyzed in new and innovative ways.

Developer Guide

Development

While direct access to the index is not provided, development within the GFBio search ecosystem is facilitated through the GFBio search stack. This environment includes a dockerized test index linked to a Node Express backend and the GFBio search UI frontend, offering a comprehensive setup for developers to experiment with and enhance the search infrastructure. This stack is designed for ease of setup and use, allowing anyone to engage with and contribute to the development of the search infrastructure.

Maintaining the Elasticsearch Index

The Elasticsearch index, crucial for the operation of the GFBio infrastructure, is central to the storage and retrieval of harvested documents. Named "portals," this index aggregates all documents in a unified repository without segregating test indexes from production data. Such a configuration simplifies the architecture but necessitates diligent management to ensure data integrity and performance.

For effective maintenance of the Elasticsearch index, developers are advised to utilize diagnostic and management tools like iotop and curl. iotop is a command-line utility that provides real-time monitoring of disk I/O usage by processes, which is invaluable for identifying performance bottlenecks or excessive disk access that could indicate a need for optimization. On the other hand, curl is a versatile tool for transferring data with URLs, recommended for interacting with the Elasticsearch REST API for index management tasks such as querying index status, performing health checks, or initiating index operations.

Given the extensive data managed within the "portals" index, developers must be aware that maintenance and indexing operations can require significant disk space, temporarily needing at least 150 GiB. This requirement underscores the importance of provisioning adequate storage resources to accommodate both the existing dataset and additional space for operations that may expand the index size.

Perspectives

In order to enable semantic search, we plan to expand the search index with information coming from terminologies available in BiodivPortal, our terminology repository and service. This process involves identifying synonyms and alternative labels for concepts in the terminologies of interest and including them in the index. This helps capture different ways users might express the same concept with different keywords in their queries. Additional information to be considered would be the structure of the ontology, including hierarchical relationships (e.g. broader/narrower terms), and associative relationships (e.g. "related-to", "part-of"). The expanded terms will then be integrated during the indexing mechanism. This step could involve updating the index structure, the metadata schema, or indexing pipeline to accommodate this additional information. We plan an evaluation and refinement phase where we will monitor search performance metrics, relevance feedback, and user satisfaction to refine the index expansion strategy over time.

The expansion process will be based on a lookup index, which is based on the PANGAEA Term Index structure. This second Elasticsearch index is prefilled with relevant information coming from key terminologies we deem to be pertinent for search. We extract terms directly from concept labels, synonyms, definitions as well as hierarchical information, as those proved to be efficient in the PANGAEA context.

The following table shows the structure of the lookup index illustrated with an example coming from ITIS:

key
value
example
label
The label of the term
Nerodia erythrogaster
sourceTerminology
Name of the source terminology (acronym)
ITIS
uri
Unique identifier
http://terminologies.gfbio.org/ITIS/Taxa_174244
description
Plain text description

synonyms
JSON array of synonyms
[
              "Nerodia erythrogaster alta",
              "Nerodia erythrogaster bogerti"]
broaders
JSON array of broader terms
[
              "Nerodia",
              "Chordata"]

The lookup index generation code is available in GitHub: https://github.com/biodivportal/lookup_index. An instance of the index is running on our cloud infrastructure and can be accessed by different RDC components.

References

Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. https://doi.org/10.1016/j.cageo.2008.02.023


Section


Column
width33%

Image Added



Column
width33%

Do you have questions, feedback or need help?

Contact our Helpdesk for direct support.


Column
width33%

Image Added