TABLE OF CONTENTS
Introduction
CORE harvests many different data providers across the world. Local laws, regulations, recommendations and research practices make harvesting content difficult. There are many standards to normalise how data is accessed but are often interpreted in different ways. This document describes how we recommend repositories are configured.
CORE aims to support local guidelines and practices, particularly RIOXX and OpenAIRE. If you find any conflicting recommendations with your own local policies, please contact us and we will try to assist with changing our policies to meet local needs.
UK Only
For UK institutions participating in the Research Excellence Framework 2021, we recommend reading through our REF Support Guidelines. If you find any ambiguity between this document and our General Harvesting Guidelines, the REF support Guidelines takes priority.
How to understand this document (Recommended)
The guideline which we recommend you follow
Explanation:
Each guideline will follow the same pattern. A title, the status, the guideline and an explanation of the meaning.
A guideline status can be one of the following:
Required
If this guideline is not followed, we will be unable to harvest your repository
Recommended
Following this recommendation will yield the best results for CORE to harvest your content
Supported
We will still be able to harvest your repository but it may not be optimal, we may miss content or will take longer to harvest
If you have any questions, please contact us and we will be happy to assist.
Metadata
Supported Protocols (Required)
A data provider must support OAI-PMH or ResourceSync
Explanation:
In order for CORE to harvest a data provider, they must have an OAI-PMH or ResourceSync compliant endpoint. Most common software such as EPrints, DSpace or Open Journal Systems (OJS) support OAI-PMH.
For OAI-PMH, we support the following metadataPrefix:
oai_dc
rioxx
qdc
If you create a customised metadataPrefix which follows our guidelines, we recommend calling it dc_core.
Direct link to the fulltext Recommended
<dc:identifier> contains an unabiguous url to the full text
Explanation:
dc:identifier must contain 1 url which links to the fulltext. Resolvers and registries such as handle.net and doi.org are accepted.
CORE supports an unlimited number of dc:identifier fields. We expect 1 of these fields to contain a direct link to the full text. The only limit is dependent on the application profile. For example, only 1 dc:identifier field per item is allowed in RIOXX.
EPrints follows this recommendation by default. It usually looks like this:
<dc:identifier>http://oro.open.ac.uk/37823/1/jcdl2013_v7.pdf</dc:identifier>
Indirect link to the fulltext Supported
<dc:identifier> contains a url to a metadata page which provides a “download pdf” url
Explanation:
Some repository software (e.g. DSpace) inserts a metadata page into dc:identifier:
<dc:identifier>http://oro.open.ac.uk/37823/</dc:identifier>
We are not always able to identify the fulltext using this setup.
If we are able to find a fulltext, we run a process called “Title Matching” to ensure we have found the correct item. This compares the title in the metadata with the title in the fulltext. This process is not 100% perfect, causes extra load on the server and requires more processing time per document. OCR documents are impossible to harvest when OCR not been performed.
Following our “Direct link to the fulltext” recommendation avoids this situation. We are even able to harvest scanned full texts.
Resolvers and registries such as handle.net and doi.org are supported in this scenario.
Fulltexts
Fulltext format (Required)
The fulltext is uploaded as a pdf, doc or docx
Explanation:
CORE only support valid pdfs, doc or docx files where there is extractable text.
If a pdf is from a scan, ensure that the file has had OCR text fields applied to the file.
Fulltext links are hosted on the same domain (Required)
The fulltext is hosted on the same domain, or subdomains as the OAI-PMH endpoint.
Explanation:
To ensure we harvest the correct content, fulltexts must be hosted on the same domain, or subdomains as the OAI-PMH endpoints.
Exceptions are made where the domain is owned by the same organisation.
We are unable to harvest fulltexts hosted on shared services such as Dropbox, Google Drive or OneDrive.