Getting started
To embark on utilizing the GFBio harvesting infrastructure, the initial step involves the installation for the foundation for its operation. This process is centered around configuring two critical components on a server that meets specific hardware prerequisites. The required server configuration is relatively robust, necessitating approximately 14 CPU cores and 32GB of RAM. This specification ensures that the server can handle the demands of processing and managing the extensive datasets typical in biological research.
The first component to be installed is Elasticsearch, a highly scalable open-source full-text search and analytics engine. Elasticsearch is instrumental in enabling the efficient storage, search, and analysis of large volumes of data in near real-time. Its role in the GFBio infrastructure is pivotal for supporting advanced data discovery and retrieval functionalities, making it easier for researchers to access and utilize biological data.
The second component is NGINX, which is employed as a proxy server. NGINX is renowned for its high performance, stability, rich feature set, simple configuration, and low resource consumption. In the context of the GFBio search infrastructure, NGINX acts as an intermediary for requests from clients seeking resources from the server. It enhances security, manages load balancing, and ensures efficient traffic handling, thereby contributing to the overall reliability and performance of the system.
User Guide
Accessing the Index: Overview
The Elasticsearch index, containing a wealth of metadata from various biological research data sources, is a critical asset of the GFBio infrastructure. However, to ensure security, data integrity, and the efficiency of data retrieval, direct user access to this index is restricted. Users interact with the index through higher-level interfaces designed to provide a user-friendly experience while maintaining the underlying data's complexity and structure.
Programmatic Access through an API
For users and developers looking to integrate GFBio data into their applications or workflows, the infrastructure offers programmatic access through an API. This API, a component of the search UI, allows for automated queries to the Elasticsearch index. It is designed for those who require a more flexible and customizable way to search and retrieve data, catering to specific research needs or application requirements.
The API provides a robust set of endpoints for querying the index, enabling users to specify search criteria, filter results, and access detailed metadata programmatically. This method is ideal for users who need to perform complex searches, automate data retrieval processes, or integrate GFBio data into third-party software or research tools.
Web-Based Application for User-Friendly Access
For users who prefer a more visual and interactive approach, the GFBio infrastructure offers a web-based application, also accessible through the search UI. This application provides a graphical interface for searching the Elasticsearch index, making it accessible to researchers, educators, and the general public without requiring technical expertise in API usage or query languages.
The web-based application simplifies the process of finding and accessing biological research data. Users can perform searches using intuitive forms and filters, browse search results, and view detailed metadata for individual records. This interface is designed to lower the barrier to accessing and utilizing the wealth of data available in the GFBio infrastructure, supporting a wide range of users from different backgrounds and with varying levels of technical proficiency.
RDC-Integration
Seamless Integration with the Research Data Commons
The Research Data Commons (RDC) is designed as a comprehensive ecosystem that facilitates the sharing, discovery, and reuse of research data across various scientific disciplines. At the heart of this ecosystem is the need for robust mechanisms that enable efficient search and retrieval of metadata, which is where the GFBio harvesting infrastructure plays a crucial role.
The Elasticsearch index, managed by the GFBio infrastructure, serves as a centralized repository of harvested documents, encompassing a wide array of metadata related to biological research. This index is made accessible through the GFBio search API, a powerful interface that allows for sophisticated querying capabilities. By leveraging this API, the integration with the RDC is realized, enabling users and systems within the RDC network to query and retrieve metadata stored in the Elasticsearch index.
Efficient Search and Retrieval of Metadata
The integration's core functionality is the efficient search and retrieval of metadata, which is critical for supporting the RDC's broader goals. The GFBio search API facilitates this by providing a flexible and powerful means to access the indexed data. Users can perform complex queries that can filter results based on various criteria, such as geographical location, species information, or any other metadata fields stored in the index. This capability ensures that researchers and other stakeholders can find the specific data they need quickly and efficiently.
Supporting Data Accessibility and Interoperability
The seamless integration between the GFBio harvesting infrastructure and the RDC through the search API significantly enhances data accessibility. By making it easier for researchers to find and access the data they need, this integration supports the broader scientific community's efforts to advance knowledge and foster innovation.
Moreover, interoperability is a key benefit of this integration. The use of standardized protocols and formats, such as those employed by the Elasticsearch index and the GFBio search API, ensures that data can be easily shared and used across different systems and applications within the RDC ecosystem. This interoperability is essential for collaborative research efforts, enabling data to be combined and analyzed in new and innovative ways.
Developer Guide
Development
While direct access to the index is not provided, development within the GFBio search ecosystem is facilitated through the GFBio search stack. This environment includes a dockerized test index linked to a Node Express backend and the GFBio search UI frontend, offering a comprehensive setup for developers to experiment with and enhance the search infrastructure. This stack is designed for ease of setup and use, allowing anyone to engage with and contribute to the development of the search infrastructure.
Maintaining the Elasticsearch Index
The Elasticsearch index, crucial for the operation of the GFBio infrastructure, is central to the storage and retrieval of harvested documents. Named "portals," this index aggregates all documents in a unified repository without segregating test indexes from production data. Such a configuration simplifies the architecture but necessitates diligent management to ensure data integrity and performance.
For effective maintenance of the Elasticsearch index, developers are advised to utilize diagnostic and management tools like iotop and curl. iotop is a command-line utility that provides real-time monitoring of disk I/O usage by processes, which is invaluable for identifying performance bottlenecks or excessive disk access that could indicate a need for optimization. On the other hand, curl is a versatile tool for transferring data with URLs, recommended for interacting with the Elasticsearch REST API for index management tasks such as querying index status, performing health checks, or initiating index operations.
Given the extensive data managed within the "portals" index, developers must be aware that maintenance and indexing operations can require significant disk space, temporarily needing at least 150 GiB. This requirement underscores the importance of provisioning adequate storage resources to accommodate both the existing dataset and additional space for operations that may expand the index size.
References
Schindler, U, Diepenbroek, M, 2008. Generic XML-based Framework for Metadata Portals. Computers & Geosciences 34 (12), 1947-1955. https://doi.org/10.1016/j.cageo.2008.02.023
