Skip to Content
🎉 HMC Dashboard on Open and FAIR Data in Helmholtz 3.0 is finally released!
ToolboxToolboxHarvesters

Harvesters

The harvesters of the HMC Toolbox for Data Mining collect metadata of literature publications from multiple sources. While multiple technologies and APIs are availble, the harvesting process is designed to harvest OAI-PMH  endpoints in a generic way. In practice, these endpoints are offered by the libraries of the research centers to be integrated.

SickleHarvester

This module uses the lightweight python library Sickle  to download data from an OAI-PMH  endpoint into an output directory.

It checks which metadata schemas are available and can harvest all of them into separate sub-directories. OAI-PMH  specifications require at least oai_dc as a metadata schema.
Hence, the SickleHarvester requires at leasts this schema.

Files harvested are stored in an xml format in the first place. Their filenames are generated in the format oai_record_identifier.schema.xml. Please note that in the next stage, these xml files are parsed for metadata by the Metadata Extractor module.

Special Remarks

Although the SickleHarvester is capable to harvest all OAI-PMH metadata schemas, the current implementation of the HMC Toolbox for Data Mining makes use of MarcXML  and DublinCore  only.

Disclaimer

Please note that the list of data publications obtained from data harvesting using the HMC Toolbox for Data Mining, as presented in the HMC FAIR Data Dashboard is neither complete nor entirely free of falsely identified data. If you wish to reuse the data shown in the dashboard for sensitive topics such as funding mechanisms, we highly recommend a manual review of the data.

Last updated on May 26, 2026