Overview

The GFBio (German Federation for Biological Data) harvesting infrastructure is designed to efficiently gather data on a daily basis from a variety of providers, primarily leveraging the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). This protocol is widely recognized for its effectiveness in facilitating the interoperable exchange of metadata over the internet, making it a cornerstone of GFBio's data collection strategy. To ensure a comprehensive coverage that accommodates the diverse nature of biological data, the infrastructure is also equipped to handle other specialized protocols, such as Access To Biological Collection Data (ABCD). ABCD is specifically tailored for the exchange of information about biological collections, representing a critical component of biodiversity informatics.

In its pursuit of optimizing data discovery, the GFBio infrastructure employs a unique metadata format known as "pansimple." This format is an enhanced version of the Dublin Core standard, developed by PANGAEA, an information system for earth and environmental science (https://www.pangaea.de/). Pansimple extends the basic Dublin Core metadata elements to offer advanced search capabilities without incorporating usage metadata, which refers to how data about how datasets are accessed and used. This deliberate design choice ensures that the metadata structure remains focused on discovery.

The pansimple format is specifically engineered to support basic metadata elements with structured extensions. These extensions include geographical coordinates, which are crucial for locating biological data in a spatial context, and typed enumerations for faceting. Faceting refers to the process of organizing search results into categories based on metadata attributes, thereby enhancing the user's ability to filter and find relevant datasets. By incorporating these features, the pansimple format significantly improves the efficiency and precision of data discovery within the GFBio infrastructure, catering to the specific needs of the biological research community.

In summary, the GFBio harvesting infrastructure represents a sophisticated approach to data collection and discovery in the field of biological research. By leveraging the OAI-PMH protocol, accommodating special cases with protocols like ABCD, and employing the pansimple metadata format, it ensures that researchers have access to a rich, easily navigable repository of biological data.

LOGO (work in progress)

Status: PRODUCTIVE

Weblink: https://search.gfbio.org/

Target group: internal service

Keywords: search, semantic search

RDC Integration: integrated

Product owner: GFBio e.V.

Getting started

To embark on utilizing the GFBio harvesting infrastructure, the initial step involves the installation for the foundation for its operation. This process is centered around configuring two critical components on a server that meets specific hardware prerequisites. The required server configuration is relatively robust, necessitating approximately 14 CPU cores and 32GB of RAM. This specification ensures that the server can handle the demands of processing and managing the extensive datasets typical in biological research.

The first component to be installed is Elasticsearch, a highly scalable open-source full-text search and analytics engine. Elasticsearch is instrumental in enabling the efficient storage, search, and analysis of large volumes of data in near real-time. Its role in the GFBio infrastructure is pivotal for supporting advanced data discovery and retrieval functionalities, making it easier for researchers to access and utilize biological data.

The second component is NGINX, which is employed as a proxy server. NGINX is renowned for its high performance, stability, rich feature set, simple configuration, and low resource consumption. In the context of the GFBio search infrastructure, NGINX acts as an intermediary for requests from clients seeking resources from the server. It enhances security, manages load balancing, and ensures efficient traffic handling, thereby contributing to the overall reliability and performance of the system.

User Guide

Accessing the Index: Overview

The Elasticsearch index, containing a wealth of metadata from various biological research data sources, is a critical asset of the GFBio infrastructure. However, to ensure security, data integrity, and the efficiency of data retrieval, direct user access to this index is restricted. Users interact with the index through higher-level interfaces designed to provide a user-friendly experience while maintaining the underlying data's complexity and structure.

Programmatic Access through an API

For users and developers looking to integrate GFBio data into their applications or workflows, the infrastructure offers programmatic access through an API. This API, a component of the search UI, allows for automated queries to the Elasticsearch index. It is designed for those who require a more flexible and customizable way to search and retrieve data, catering to specific research needs or application requirements.

The API provides a robust set of endpoints for querying the index, enabling users to specify search criteria, filter results, and access detailed metadata programmatically. This method is ideal for users who need to perform complex searches, automate data retrieval processes, or integrate GFBio data into third-party software or research tools.

Web-Based Application for User-Friendly Access

For users who prefer a more visual and interactive approach, the GFBio infrastructure offers a web-based application, also accessible through the search UI. This application provides a graphical interface for searching the Elasticsearch index, making it accessible to researchers, educators, and the general public without requiring technical expertise in API usage or query languages.

The web-based application simplifies the process of finding and accessing biological research data. Users can perform searches using intuitive forms and filters, browse search results, and view detailed metadata for individual records. This interface is designed to lower the barrier to accessing and utilizing the wealth of data available in the GFBio infrastructure, supporting a wide range of users from different backgrounds and with varying levels of technical proficiency.

RDC-Integration

Seamless Integration with the Research Data Commons

The Research Data Commons (RDC) is designed as a comprehensive ecosystem that facilitates the sharing, discovery, and reuse of research data across various scientific disciplines. At the heart of this ecosystem is the need for robust mechanisms that enable efficient search and retrieval of metadata, which is where the GFBio harvesting infrastructure plays a crucial role.

The Elasticsearch index, managed by the GFBio infrastructure, serves as a centralized repository of harvested documents, encompassing a wide array of metadata related to biological research. This index is made accessible through the GFBio search API, a powerful interface that allows for sophisticated querying capabilities. By leveraging this API, the integration with the RDC is realized, enabling users and systems within the RDC network to query and retrieve metadata stored in the Elasticsearch index.

Efficient Search and Retrieval of Metadata

The integration's core functionality is the efficient search and retrieval of metadata, which is critical for supporting the RDC's broader goals. The GFBio search API facilitates this by providing a flexible and powerful means to access the indexed data. Users can perform complex queries that can filter results based on various criteria, such as geographical location, species information, or any other metadata fields stored in the index. This capability ensures that researchers and other stakeholders can find the specific data they need quickly and efficiently.

Supporting Data Accessibility and Interoperability

The seamless integration between the GFBio harvesting infrastructure and the RDC through the search API significantly enhances data accessibility. By making it easier for researchers to find and access the data they need, this integration supports the broader scientific community's efforts to advance knowledge and foster innovation.

Moreover, interoperability is a key benefit of this integration. The use of standardized protocols and formats, such as those employed by the Elasticsearch index and the GFBio search API, ensures that data can be easily shared and used across different systems and applications within the RDC ecosystem. This interoperability is essential for collaborative research efforts, enabling data to be combined and analyzed in new and innovative ways.

Developer Guide

Development

While direct access to the index is not provided, development within the GFBio search ecosystem is facilitated through the GFBio search stack. This environment includes a dockerized test index linked to a Node Express backend and the GFBio search UI frontend, offering a comprehensive setup for developers to experiment with and enhance the search infrastructure. This stack is designed for ease of setup and use, allowing anyone to engage with and contribute to the development of the search infrastructure.

Maintaining the Elasticsearch Index

The Elasticsearch index, crucial for the operation of the GFBio infrastructure, is central to the storage and retrieval of harvested documents. Named "portals," this index aggregates all documents in a unified repository without segregating test indexes from production data. Such a configuration simplifies the architecture but necessitates diligent management to ensure data integrity and performance.

For effective maintenance of the Elasticsearch index, developers are advised to utilize diagnostic and management tools like iotop and curl. iotop is a command-line utility that provides real-time monitoring of disk I/O usage by processes, which is invaluable for identifying performance bottlenecks or excessive disk access that could indicate a need for optimization. On the other hand, curl is a versatile tool for transferring data with URLs, recommended for interacting with the Elasticsearch REST API for index management tasks such as querying index status, performing health checks, or initiating index operations.

Given the extensive data managed within the "portals" index, developers must be aware that maintenance and indexing operations can require significant disk space, temporarily needing at least 150 GiB. This requirement underscores the importance of provisioning adequate storage resources to accommodate both the existing dataset and additional space for operations that may expand the index size.

References

Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. https://doi.org/10.1016/j.cageo.2008.02.023

Space shortcuts

Page tree

Overview

Getting started

User Guide

RDC-Integration

Developer Guide

References

Space shortcuts

Page tree

Elasticsearch Index (Search and Harvesting-Infrastructure)

Overview

Getting started

User Guide

RDC-Integration

Developer Guide

References