Training datasets
Training datasets are essential for the effective teaching and training of young researchers. They form the basis for teaching data skills and analysis methods. Here, we are refering to datasets used in tutorials on research data management, as demo data set in tools or methods, or as examples for challenges in data handling. This definition does not cover datasets used to train AI applications.
To be labelled as a training dataset they have to:
- be FAIR (Findable, Accessible, Interoperable, Reusable).
- be freely available, with an appropriate license and open data format.
- be of reasonable size.
- be citable.
- enable easy-to-understand but interesting questions to be addressed.
- be sufficiently documented.
- be either “perfect” or datasets with didactic errors.
For an overview, check the Poster on What are training datasets in the context of NFDI4Biodiversity (in German): Signer, J., Schlägel, U., Tschink, D., & Röder, J. (2024). Trainingsdatensätze. Zenodo. https://doi.org/10.5281/zenodo.13805722
Training datasets help to illustrate all stages of the data life cycle (DLC), e.g.
- Metadata standards to describe and structure (newly collected) data
- (Reproducible) processing of data
- (Reproducible) data analysis
- Workflows to archive, share and publish data for personal and/or public re-use