Overview

ena2pansimple is an OAI-PMH endpoint developed in Python and Django. This software serves as a crucial component within the GFBio harvesting, indexing, and search infrastructure. Its primary function is to collect metadata submissions from the ENA (European Nucleotide Archive) repository, transform this information into Pansimple and Darwin Core formats, and thereby provide structured and standardized access to this information for the GFBio search index and search UI. Researchers, educators, and other stakeholders can then easily query the GFBio Search (search.gfbio.org) to find relevant data for their projects, streamlining the research process and fostering innovation.

Fig1: The diagram showcases the architecture of a search index and harvesting infrastructure. At its core is the Elasticsearch Index, which aggregates metadata from various sources. The metadata is collected by specific harvesters, including panFMP and BMS-Harvester, which respectively gather data from OAI-PMH Providers and BioCASe Providers through the BioCASe Monitor Service (BMS). On the user-facing side, the Node Express Backend serves the GFBio Search application, providing access to the indexed data via a public API. The diagram highlights the general flow of data from collection to search interface, illustrating the complex interactions and the pivotal role of Elasticsearch in enabling efficient search and retrieval. Note: This is not a technical documentation diagram and the connections deliberately do not reflect 100 percently all data flow between the applications.

LOGO (Work in progress)

Status: PRODUCTIVE

Weblink: https://ena2pansimple.gfbio.org/

Target group: data user

Keywords: indexing, harvesting, search

RDC Integration: integrated

Product owner: GfBio e.V.

User Guide

Users interested in accessing the data can query the OAI-PMH compliant API provided by ena2pansimple. The API endpoint is accessible at https://ena2pansimple.gfbio.org/oai/?verb=Identify. This interface adheres to the OAI-PMH standards, ensuring a consistent and standardized method for data retrieval.

For those new to querying this type of API, ena2pansimple offers straightforward examples on how to effectively query the endpoint. These examples can be found at https://ena2pansimple.gfbio.org/about/ and provide a practical starting point for users to familiarize themselves with the process. The examples cover basic queries that demonstrate how to retrieve data, making it easier for users to begin exploring the available metadata.

For more comprehensive information on how to use the endpoint, users are encouraged to consult the documentation of OAI-PMH itself. The official OAI-PMH documentation, available at http://www.openarchives.org/OAI/openarchivesprotocol.html, offers an in-depth look at the protocol, including its architecture, operations, and the types of requests that can be made. This resource is invaluable for users who wish to gain a deeper understanding of how OAI-PMH works and how to leverage its capabilities for data harvesting.

Developer Guide

The ena2pansimple tool is developed within a Dockerized Django environment, offering a streamlined setup for both usage and development. This approach ensures that the tool can be easily deployed, run, and developed upon by encapsulating its environment and dependencies, making it highly accessible for developers and users alike.

To begin working with ena2pansimple, whether for trying it out or for further development, the repository provides comprehensive instructions and resources. These guidelines facilitate a smooth setup process, allowing users to get the tool running locally on their machines.

For detailed setup instructions and to access the source code, visit the GitLab repository at https://gitlab-pe.gwdg.de/gfbio/ena2pansimple. Here, you will find all the necessary documentation on how to get ena2pansimple up and running in your local environment. This includes steps for Docker installation, setting up the Django environment, and configuring the application to connect to the ENA repository for data retrieval.

RDC Integration

ena2pansimple plays a pivotal role in the GFBio (German Federation for Biological Data) ecosystem by providing structured access to the metadata of submissions to the European Nucleotide Archive (ENA). As a central component of the GFBio harvesting, indexing, and search infrastructure, it bridges the gap between vast biological data in the ENA repository and potential end-users, facilitating efficient data discovery and utilization.

The tool serves as the primary interface for harvesting metadata from the ENA, transforming it into formats that are compatible with the broader GFBio infrastructure. Once the data is harvested, ena2pansimple integrates it into the GFBio Elasticsearch index. This process not only standardizes the data but also makes it readily searchable, significantly enhancing the accessibility and usability of the information contained within the ENA.

The integration of harvested data into the GFBio Elasticsearch index is a critical step that enables the GFBio Search API to provide access to this metadata. This means that all components within the Research Data Commons (RDC) can leverage this consolidated, searchable pool of metadata, facilitating interdisciplinary research and collaboration.

References

https://www.openarchives.org/pmh/

Space shortcuts

Page tree

Overview

User Guide

Developer Guide

RDC Integration

References

Space shortcuts

Page tree

ena2pansimple (Search and Harvesting-Infrastructure)

Overview

User Guide

Developer Guide

RDC Integration

References