Info

The ena2pansimple tool enhances the GFBio infrastructure by efficiently converting ENA (European Nucleotide Archive) metadata into formats that are easily searchable within the GFBio search platform. It serves as a vital bridge for researchers, allowing streamlined access to biological data through an OAI-PMH compliant API. By integrating ENA metadata into the GFBio Elasticsearch index, it significantly boosts data discovery and supports interdisciplinary research. The GitLab repository provides detailed instructions for setup and use, making ena2pansimple essential for enhancing accessibility and usability of biological data in research communities.

Overview

ena2pansimple is an OAI-PMH endpoint developed in Python and Django. This software serves as a crucial component within the GFBio harvesting, indexing, and search infrastructure. Its primary function is to collect metadata from the ENA (European Nucleotide Archive) repository, transform this information into PanSimple and Darwin Core formats, and thereby provide structured and standardized access to this information for the GFBio search index and search UI. Researchers, educators, and other stakeholders can then easily query the GFBio Search (https://search.gfbio.org) to find relevant data for their projects, streamlining the research process and fostering innovation.

Image Added

Fig1: The diagram showcases the architecture of a search index and harvesting infrastructure. At its core is the Elasticsearch Index, which aggregates metadata from various sources. The metadata is collected by specific harvesters, including panFMP and BMS-Harvester, which respectively gather data from OAI-PMH Providers and BioCASe Providers through the BioCASe Monitor Service (BMS). On the user-facing side, the Node Express Backend serves the GFBio Search application, providing access to the indexed data via a public API. The diagram highlights the general flow of data from gathering to search interface, illustrating the complex interactions and the pivotal role of Elasticsearch in enabling efficient search and retrieval. Note: This is not a technical documentation diagram and the connections deliberately do not reflect 100 Percently all data flow between the applications.

Overview

A django web application based metadata harvester for the European Nucleotide Archive and OAI-PMH endpoint (https://gitlab-pe.gwdg.de/gfbio/ena2pansimple)

LOGO (work in progress)

Status:

Panel

borderColor	#7AF842
borderWidth	2

Status:

Status

colour	Green
title	Productive

Weblink: https://ena2pansimple.gfbio.org/

Target group: data user

Keywords: indexing, harvesting, search

RDC Integration: integrated

Product owner:

GfBio

GFBio e.V.

Getting started

A web-based solution which hooks into the European Nucleotide Archive (ENA) (

Talk

id	talk-1109

https://bit.ly/3pPEdv0) metadata API to harvest records. The tool includes a scheduler for regular metadata harvesting and a database to store the metadata. As a post-processing step to the harvesting the tool converts the items into differen target format utilizing XSLT transformations

Talk

id	talk-1110

. By now the tool support the the format oai-pmh and pansimple. Finally the service provides access to the collected and transformed resources via an OAI-PMH comliant API for oai-pmh harhvester clients to consume the records

User Guide

Users who want to consume interested in accessing the data can query the OAI-PHM compliant API which can be found here PMH compliant API provided by ena2pansimple. The API endpoint is accessible at https://ena2pansimple.gfbio.org/oai/?verb=Identify some

Talk

id	talk-1111

simple . This interface adheres to the OAI-PMH standards, ensuring a consistent and standardized method for data retrieval. For those new to querying this type of API, ena2pansimple offers straightforward examples on how to effectively query the entpoint are listed here endpoint. These examples can be found at https://ena2pansimple.gfbio.org/about/ more

Talk

id	talk-1112

general and provide a practical starting point for users to familiarize themselves with the process. The examples cover basic queries that demonstrate how to retrieve data, making it easier for users to begin exploring the available metadata.

For more comprehensive information on how to use the endpoint you can find in , users are encouraged to consult the documentation of OAI-PMH itself. The official OAI-PMH documentation, available at http://www.openarchives.org/OAI/openarchivesprotocol.html offers an in-depth look at the protocol, including its architecture, operations, and the types of requests that can be made. This resource is invaluable for users who wish to gain a deeper understanding of how OAI-PMH works and how to leverage its capabilities for data harvesting.

Developer Guide

The ena2pansimple tool is developed within a Dockerized Django environment, offering a streamlined setup for both usage and development. This approach ensures that the tool can be easily deployed, run, and developed upon by encapsulating its environment and dependencies, making it highly accessible for developers and users alike.

To begin working with ena2pansimple, whether for trying it out or for further development, the repository provides comprehensive instructions and resources. These guidelines facilitate a smooth setup process, allowing users to get the tool running locally on their machines.

For detailed setup instructions and to access the source code, visit the GitLab repository at in a dockerized django setup. The repository itself contains detailed information on how to get running locally which you can use to try it out yourself and for further development (https://gitlab-pe.gwdg.de/gfbio/ena2pansimple. Here, you will find all the necessary documentation on how to get ena2pansimpleup and running in your local environment. This includes steps for Docker installation, setting up the Django environment, and configuring the application to connect to the ENA repository for data retrieval.

RDC Integration

ena2pansimple plays a pivotal role in the GFBio (German Federation for Biological Data) ecosystem by providing structured access to the metadata of submissions to the European Nucleotide Archive (ENA)

References Talkidtalk-1113

. As a central component of the GFBio harvesting, indexing, and search infrastructure, it bridges the gap between vast biological data in the ENA repository and potential end-users, facilitating efficient data discovery and utilization.

The tool serves as the primary interface for harvesting metadata from the ENA, transforming it into formats that are compatible with the broader GFBio infrastructure. Once the data is harvested, ena2pansimple integrates it into the GFBio Elasticsearch index. This process not only standardizes the data but also makes it readily searchable, significantly enhancing the accessibility and usability of the information contained within the ENA.

The integration of harvested data into the GFBio Elasticsearch index is a critical step that enables the GFBio Search API to provide access to this metadata. This means that all components within the Research Data Commons (RDC) can leverage this consolidated, searchable pool of metadata, facilitating interdisciplinary research and collaboration.

References

https://www.openarchives.org/pmh/

Space shortcuts

Page tree

Versions Compared

Old Version 10

New Version Current

Key

Overview

Overview

Getting started

User Guide

Developer Guide

RDC Integration

References

Space shortcuts

Page tree

Page History

Versions Compared

Old Version 10

New Version Current

Key

Overview

Overview

Getting started

User Guide

Developer Guide

RDC Integration

References