Harvesters

The harvesters of the HMC Toolbox for Data Mining collect metadata of literature publications from multiple sources. While multiple technologies and APIs are availble, the harvesting process is designed to harvest OAI-PMH (opens in a new tab) endpoints in a generic way. In practice, these endpoints are offered by the libraries of the research centers to be integrated.

SickleHarvester

This module uses the lightweight python library Sickle (opens in a new tab) to download data from an OAI-PMH (opens in a new tab) endpoint into an output directory.

It checks which metadata schemas are available and can harvest all of them into separate sub-directories. OAI-PMH (opens in a new tab) specifications require at least oai_dc as a metadata schema.
Hence, the SickleHarvester requires at leasts this schema.

Files harvested are stored in an xml format in the first place. Their filenames are generated in the format oai_record_identifier.schema.xml. Please note that in the next stage, these xml files are parsed for metadata by the Metadata Extractor module.

Special Remarks

Although the SickleHarvester is capable to harvest all OAI-PMH metadata schemas, the current implementation of the HMC Toolbox for Data Mining makes use of MarcXML (opens in a new tab) and DublinCore (opens in a new tab) only.

Disclaimer

Please note that the list of data publications obtained from data harvesting using the HMC Toolbox for Data Mining, as presented in the HMC FAIR Data Dashboard is neither complete nor entirely free of falsely identified data. If you wish to reuse the data shown in the dashboard for sensitive topics such as funding mechanisms, we highly recommend a manual review of the data.

Overview Metadata Extractor