You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

Overview

Status of Service

The GFBio harvesting infrastructure primarily utilizes the OAI-PMH protocol for daily data collection from several providers. To accommodate special cases, other protocols are also implemented (Acess To Biological Collection Data ABCD). The infrastructure adopts "pansimple," an extended version of Dublin Core developed by PANGAEA, for enhanced data discovery without including useage metadata. This format is tailored for search purposes, supporting basic metadata with structured extensions for geographical coordinates and typed enumerations for faceting.

Getting started

The basic installation involves setting up Elasticsearch and NGINX as proxy on a server with sufficient hardware requirements (approx. 14 CPU, 32Gb RAM). Detailed installation instructions can be found in the GitLab repository: https://gitlab.gwdg.de/gfbio/harvesting.

User Guide

For maintaining the Elasticsearch index, tools like iotop and curl are recommended. The index, named "portals," contains all harvested documents without differentiation between test indexes. Maintenance and indexing operations may temporarily require up to 150 GiB of space.

RDC-Integration

The Elasticsearch index is accessible via the GFBio search API, providing a seamless integration with the Research Data Commons. This setup allows for the efficient search and retrieval of metadata, supporting the broader goals of data accessibility and interoperability within the RDC.

Developer Guide

Developers looking to work with Elasticsearch and NGINX within the GFBio infrastructure should consult the configuration files and plugin JARs available in the GitLab repository. For advanced configurations, such as setting up additional data nodes or securing SSH access, developers should follow the guidelines provided in the documentation and consider network security implications.

References


  • No labels