The GFBio search is an innovative platform that provides access to a wide range of biological data from various sources, including the European Nucleotide Archives and PANGAEA. Featuring a user-friendly interface, advanced filtering options, and a semantic search that incorporates related terms through a specialized ontology, it facilitates precise data retrieval for researchers and enthusiasts alike. Additionally, it offers developers the ability to customize the search for specific scientific domains through an open API and customizable and open source code. Visit https://search.gfbio.org/ to dive directly into the exploration of biological data.

Overview

The GFBio search is based on the Dai:Si search UI and enables a search of data distributed and published across the GFBio data centers as well as of datasets mobilised through selected data providers from NFDI4Biodiversity Use Cases. The data centers, data providers and the data sources they provide are listed at the GFBio instance of the Biocase Monitoring Service (BMS): https://bms.gfbio.org/ . Beyond this listing there is also data coming from the European Nucleotide Archive (ENA) and PANGAEA Data Publisher for Earth and Environmental Science. A harvester is collecting the resources from the BMS and other data centers and extracts the information into an Elasticsearch index, which enables a keyword search on the metadata of datasets and filtering of data, e.g. for data centers, locations, dates and times.

Fig1: The diagram showcases the architecture of a search index and harvesting infrastructure. At its core is the Elasticsearch Index, which aggregates metadata from various sources. The metadata is collected by specific harvesters, including panFMP and BMS-Harvester, which respectively gather data from OAI-PMH Providers and BioCASe Providers through the BioCASe Monitor Service (BMS). On the user-facing side, the Node Express Backend serves the GFBio Search application, providing access to the indexed data via a public API. The diagram highlights the general flow of data from gathering to search interface, illustrating the complex interactions and the pivotal role of Elasticsearch in enabling efficient search and retrieval. Note: This is not a technical documentation diagram and the connections deliberately do not reflect 100 Percently all data flow between the applications. 

Status: PRODUCTIVE

Weblink: https://search.gfbio.org/

Target group: data user 

Keywords: search, semantic search

RDC Integration: integrated

Product owner: GFBio e.V.

Getting Started

To start exploring the biological data available, open your web browser and visit https://search.gfbio.org/. This site provides a straightforward search interface where you can begin by typing your search terms directly into the search bar. If you're looking for more specific results, you can use the filters on the left side of the page. These filters help narrow down your search results by various criteria, making it easier to find the data that's most relevant to your needs. For a more nuanced search, the platform offers a semantic search option. When you use this feature, the system enhances your keyword search by adding information from the GFBio vocabulary service's ontology (https://terminologies.gfbio.org/). This means if your search term matches a term in the ontology, the search will automatically include related terms. For instance, searching for a species by its scientific name will additionally search for its common names. This approach can be used to add results that are closely related to your original search query, providing a more comprehensive search outcome. 

This approach helps in finding data that's more closely related to your search query, ensuring a more comprehensive search outcome.

RDC Integration

The search functionality is supported by an API that, as of now, is being used and assessed internally. This API is a key tool for enabling smooth interaction between different parts of the Research Data Commons (RDC). It's built to ensure that various components within the NFDI4Biodiversity framework can easily access and query the vast index of biological data that's available. The main goal of this API is to support the seamless flow of information across different platforms and databases, making it simpler for researchers and other users to find and utilize the data they need.


draw.io

Source page access restriction: Click the link below to check if the page is accessible.
/display/NFDI/External+Data+Interfaces

Customization

An exemplary implementation of the GFBio search tool can be observed at https://search.dda.gfbio.dev/. This instance showcases the potential of the GFBio search platform when customized for a specific scientific domain. By setting up their customized index, scientific organizations can curate the data available for search, ensuring that users have access to the most relevant and high-quality datasets. Furthermore, the ability to apply custom branding means that institutions can integrate the search tool seamlessly into their existing digital infrastructure, providing a consistent user experience that aligns with their visual identity and user interface design standards.

User Guide

You can use the search bar to perform a keyword based search. When using the semantic search button your search term is extended by further information when there is a match with a term in an ontology of the GFBio vocabulary service. This allows to find relevant data more reliably, e.g., extending scientific names of species with common names. On the left hand side of the search UI you can find various categories for filtering like the data center, regions and the date of data collection. This allows you to narrow down your keyword-based search in order to obtain only the search results that are most relevant to you. Each search item shows some information including a title, description, license and citation. If you click on the title you can even get more detailed information (e.g. additional media files like images).

Developer Guide

The UI component itself is implemented in Angular and combined with a thin backend layer written in Node.js, which is responsible for data processing. Developers can clone the repository and adapt the source code to their needs. A test instance of the search can be built for local testing using the start script included in the repository. The script builds the UI and the backend and runs the search interface in a dockerized environment available for your exploration and development via the browser. You can find the source code published here: https://gitlab-pe.gwdg.de/gfbio/search.gfbio.org.

References

Shafiei, F., Löffler, F., Thiel, S., Opasjumruskit, K., Grabiger, D., Rauh, P., König-Ries, B.: [Dai:Si] - A Modular Dataset Retrieval Framework with a Semantic Search for Biological Data, Proceedings of the Joint Ontology Workshops (JOWO),  2021, url = https://ceur-ws.org/Vol-2969/paper4-s4biodiv.pdf