The GFBio (German Federation for Biological Data) harvesting infrastructure is designed to efficiently gather data on a daily basis from a variety of providers, primarily leveraging the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). This protocol is widely recognized for its effectiveness in facilitating the interoperable exchange of metadata over the internet, making it a cornerstone of GFBio's data collection strategy. To ensure a comprehensive coverage that accommodates the diverse nature of biological data, the infrastructure is also equipped to handle other specialized protocols, such as Access To Biological Collection Data (ABCD). ABCD is specifically tailored for the exchange of information about biological collections, representing a critical component of biodiversity informatics.
In its pursuit of optimizing data discovery, the GFBio infrastructure employs a unique metadata format known as "pansimple." This format is an enhanced version of the Dublin Core standard, developed by PANGAEA, an information system for earth and environmental science. Pansimple extends the basic Dublin Core metadata elements to offer advanced search capabilities without incorporating usage metadata, which refers to data about how datasets are accessed and used. This deliberate design choice ensures that the metadata structure remains focused on discovery and identification rather than tracking user interactions.
The pansimple format is specifically engineered to support basic metadata elements with structured extensions. These extensions include geographical coordinates, which are crucial for locating biological data in a spatial context, and typed enumerations for faceting. Faceting refers to the process of organizing search results into categories based on metadata attributes, thereby enhancing the user's ability to filter and find relevant datasets. By incorporating these features, the pansimple format significantly improves the efficiency and precision of data discovery within the GFBio infrastructure, catering to the specific needs of the biological research community.
In summary, the GFBio harvesting infrastructure represents a sophisticated approach to data collection and discovery in the field of biological research. By leveraging the OAI-PMH protocol, accommodating special cases with protocols like ABCD, and employing the pansimple metadata format, it ensures that researchers have access to a rich, easily navigable repository of biological data.
Getting started
To embark on utilizing the GFBio harvesting infrastructure, the initial step involves the installation for the foundation for its operation. This process is centered around configuring two critical components on a server that meets specific hardware prerequisites. The required server configuration is relatively robust, necessitating approximately 14 CPU cores and 32GB of RAM. This specification ensures that the server can handle the demands of processing and managing the extensive datasets typical in biological research.
The first component to be installed is Elasticsearch, a highly scalable open-source full-text search and analytics engine. Elasticsearch is instrumental in enabling the efficient storage, search, and analysis of large volumes of data in near real-time. Its role in the GFBio infrastructure is pivotal for supporting advanced data discovery and retrieval functionalities, making it easier for researchers to access and utilize biological data.
The second component is NGINX, which is employed as a proxy server. NGINX is renowned for its high performance, stability, rich feature set, simple configuration, and low resource consumption. In the context of the GFBio infrastructure, NGINX acts as an intermediary for requests from clients seeking resources from the server. It enhances security, manages load balancing, and ensures efficient traffic handling, thereby contributing to the overall reliability and performance of the system.
For those looking to implement this setup, detailed installation instructions are readily available in the GitLab repository dedicated to the GFBio harvesting project. This repository (https://gitlab.gwdg.de/gfbio/harvesting) serves as a comprehensive resource, offering step-by-step guidance, configuration details, and best practices for setting up the infrastructure. It is designed to assist users, ranging from system administrators to researchers, in successfully deploying the GFBio harvesting infrastructure on their servers.
By following the instructions provided in the GitLab repository, users can ensure a smooth and efficient setup process, laying a solid foundation for leveraging the GFBio infrastructure's capabilities. This setup is crucial for facilitating the aggregation, search, and analysis of biological data, ultimately supporting the broader objectives of biological research and conservation efforts.
User Guide
For maintaining the Elasticsearch index, tools like iotop and curl are recommended. The index, named "portals," contains all harvested documents without differentiation between test indexes. Maintenance and indexing operations may temporarily require up to 150 GiB of space.
RDC-Integration
The Elasticsearch index is accessible via the GFBio search API, providing a seamless integration with the Research Data Commons. This setup allows for the efficient search and retrieval of metadata, supporting the broader goals of data accessibility and interoperability within the RDC.
Developer Guide
Developers looking to work with Elasticsearch and NGINX within the GFBio infrastructure should consult the configuration files and plugin JARs available in the GitLab repository. For advanced configurations, such as setting up additional data nodes or securing SSH access, developers should follow the guidelines provided in the documentation and consider network security implications.