Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Overview

A django web application based metadata harvester for the European Nucleotide Archive and OAI-PMH

Talk
idtalk-1114
endpoint (https://gitlab-pe.gwdg.de/gfbio/ena2pansimple)

ena2pansimple is an OAI-PMH endpoint developed in Python and Django. This software serves as a crucial component within the GFBio harvesting, indexing, and search infrastructure. Its primary function is to collect metadata submissions from the ENA (European Nucleotide Archive) repository, transform this information into Pansimple and Darwin Core formats, and thereby provide structured and standardized access to this information for the GFBio search index and search UI. Researchers, educators, and other stakeholders can then easily query the GFBio Search (search.gfbio.org) to find relevant data for their projects, streamlining the research process and fostering innovation.


nnLOGO LOGO (work in progress)

Status:

Status
colourGreen
titleProductive

Weblink: https://ena2pansimple.gfbio.org/

Target group: data user

Keywords: indexing, harvesting, search

RDC Integration: integrated

Product owner: GfBio e.V.

Getting started

A web-based solution which hooks into the European Nucleotide Archive (ENA) (
Talk
idtalk-1109
https://bit.ly/3pPEdv0) metadata API to harvest records. The tool includes a scheduler for regular metadata harvesting and a database to store the metadata. As a post-processing step to the harvesting the tool converts the items into differen target format utilizing XSLT transformations
Talk
idtalk-1110
. By now the tool support the the format oai-pmh and pansimple

.

Finally the service provides access to the collected and transformed resources via an OAI-PMH comliant API for oai-pmh harhvester clients to consume the records


User Guide

Users who want to consume interested in accessing the data can query the OAI-PHM compliant API which can be found here PMH compliant API provided by ena2pansimple. The API endpoint is accessible at https://ena2pansimple.gfbio.org/oai/?verb=Identify some

Talk
idtalk-1111
simple . This interface adheres to the OAI-PMH standards, ensuring a consistent and standardized method for data retrieval.

For those new to querying this type of API, ena2pansimple offers straightforward examples on how to effectively query the entpoint are listed here endpoint. These examples can be found at https://ena2pansimple.gfbio.org/about/ more

Talk
idtalk-1112
general and provide a practical starting point for users to familiarize themselves with the process. The examples cover basic queries that demonstrate how to retrieve data, making it easier for users to begin exploring the available metadata.

For more comprehensive information on how to use the endpoint you can find in , users are encouraged to consult the documentation of OAI-PMH itself. The official OAI-PMH documentation, available at http://www.openarchives.org/OAI/openarchivesprotocol.html, offers an in-depth look at the protocol, including its architecture, operations, and the types of requests that can be made. This resource is invaluable for users who wish to gain a deeper understanding of how OAI-PMH works and how to leverage its capabilities for data harvesting.

Developer Guide

The ena2pansimple tool is developed in a dockerized django setup. The repository itself contains detailed information on how to get running locally which you can use to try it out yourself and for further development (within a Dockerized Django environment, offering a streamlined setup for both usage and development. This approach ensures that the tool can be easily deployed, run, and developed upon by encapsulating its environment and dependencies, making it highly accessible for developers and users alike.

To begin working with ena2pansimple, whether for trying it out or for further development, the repository provides comprehensive instructions and resources. These guidelines facilitate a smooth setup process, allowing users to get the tool running locally on their machines.

For detailed setup instructions and to access the source code, visit the GitLab repository at https://gitlab-pe.gwdg.de/gfbio/ena2pansimple. Here, you will find all the necessary documentation on how to get ena2pansimple up and running in your local environment. This includes steps for Docker installation, setting up the Django environment, and configuring the application to connect to the ENA repository for data retrieval.

RDC Integration

ena2pansimple plays a pivotal role in the GFBio (German Federation for Biological Data) ecosystem by providing structured access to the metadata of submissions to the European Nucleotide Archive (ENA). As a central component of the GFBio harvesting, indexing, and search infrastructure, it bridges the gap between vast biological data in the ENA repository and potential end-users, facilitating efficient data discovery and utilization.

The tool serves as the primary interface for harvesting metadata from the ENA, transforming it into formats that are compatible with the broader GFBio infrastructure. Once the data is harvested, ena2pansimpleintegrates it into the GFBio Elasticsearch index. This process not only standardizes the data but also makes it readily searchable, significantly enhancing the accessibility and usability of the information contained within the ENA.

The integration of harvested data into the GFBio Elasticsearch index is a critical step that enables the GFBio Search API to provide access to this metadata. This means that all components within the Research Data Commons (RDC) can leverage this consolidated, searchable pool of metadata, facilitating interdisciplinary research and collaboration.

References
Talk
idtalk-1113