TABLE OF CONTENTS
CORE harvests many different data providers across the world. Local laws, regulations, recommendations and research practices make harvesting content difficult. There are many standards to normalise how data is accessed but are often interpreted in different ways. This document describes how we recommend repositories are configured.
CORE aims to support local guidelines and practices, particularly RIOXX and OpenAIRE. If you find any conflicting recommendations with your own local policies, please contact us and we will try to assist with changing our policies to meet local needs.
For UK institutions participating in the Research Excellence Framework 2021, we recommend reading through our REF Support Guidelines. If you find any ambiguity between this document and our General Harvesting Guidelines, the REF support Guidelines takes priority.
How to understand this document (Recommended)
The guideline which we recommend you follow
Each guideline will follow the same pattern. A title, the status, the guideline and an explanation of the meaning.
A guideline status can be one of the following:
If this guideline is not followed, we will be unable to harvest your repository
Following this recommendation will yield the best results for CORE to harvest your content
We will still be able to harvest your repository but it may not be optimal, we may miss content or will take longer to harvest
If you have any questions, please contact us and we will be happy to assist.
Supported Protocols (Required)
A data provider must support OAI-PMH or ResourceSync
In order for CORE to harvest a data provider, they must have an OAI-PMH or ResourceSync compliant endpoint. Most common software such as EPrints, DSpace or Open Journal Systems (OJS) support OAI-PMH.
For OAI-PMH, we support the following metadataPrefix:
If you create a customised metadataPrefix which follows our guidelines, we recommend calling it dc_core.
Direct link to the fulltext Recommended
<dc:identifier> contains an unabiguous url to the full text
CORE supports an unlimited number of dc:identifier fields. We expect 1 of these fields to contain a direct link to the full text. The only limit is dependent on the application profile. For example, only 1 dc:identifier field per item is allowed in RIOXX.
EPrints follows this recommendation by default. It usually looks like this:
Indirect link to the fulltext Supported
<dc:identifier> contains a url to a metadata page which provides a “download pdf” url
Some repository software (e.g. DSpace) inserts a metadata page into dc:identifier:
We are not always able to identify the fulltext using this setup.
If we are able to find a fulltext, we run a process called “Title Matching” to ensure we have found the correct item. This compares the title in the metadata with the title in the fulltext. This process is not 100% perfect, causes extra load on the server and requires more processing time per document. OCR documents are impossible to harvest when OCR not been performed.
Following our “Direct link to the fulltext” recommendation avoids this situation. We are even able to harvest scanned full texts.
Fulltext format (Required)
The fulltext is uploaded as a pdf, doc or docx
CORE only support valid pdfs, doc or docx files where there is extractable text.
If a pdf is from a scan, ensure that the file has had OCR text fields applied to the file.
Fulltext links are hosted on the same domain (Required)
The fulltext is hosted on the same domain, or subdomains as the OAI-PMH endpoint.
To ensure we harvest the correct content, fulltexts must be hosted on the same domain, or subdomains as the OAI-PMH endpoints.
Exceptions are made where the domain is owned by the same organisation.
We are unable to harvest fulltexts hosted on shared services such as Dropbox, Google Drive or OneDrive.
The fictional organisation “We Love Open Access”, has an OAI-PMH endpoint on
By default, we will accept any papers hosted like this:
Notice how the files and the OAI-PMH endpoint are hosted on the same domain (weloveopenaccess.org).
We can accept the following setup, but please contact us to allow a specific exception:
- https://files.weloveopenaccess.org/files/paper.pdf (with exception)
- https://weloveopenaccess-files.org/files/paper.pdf (with exception)
We may also accept where there is a clear link between the owner of the OAI-PMH and the hosting website:
However, we are not able to support the following hosting providers: