TABLE OF CONTENTS

Introduction

CORE harvests many different data providers across the world. Local laws, regulations, recommendations and research practices make harvesting content difficult. There are many standards to normalise how data is accessed but are often interpreted in different ways. This document describes how we recommend repositories are configured.

CORE aims to support local guidelines and practices, particularly RIOXX and OpenAIRE. If you find any conflicting recommendations with your own local policies, please contact us and we will try to assist with changing our policies to meet local needs.


UK Only

For UK institutions participating in the Research Excellence Framework 2021, we recommend reading through our REF Support Guidelines. If you find any ambiguity between this document and our General Harvesting Guidelines, the REF support Guidelines takes priority.


How to understand this document (Recommended)

The guideline which we recommend you follow

Explanation:

Each guideline will follow the same pattern. A title, the status, the guideline and an explanation of the meaning.

A guideline status can be one of the following:

Required

If this guideline is not followed, we will be unable to harvest your repository


Recommended

Following this recommendation will yield the best results for CORE to harvest your content


Supported

We will still be able to harvest your repository but it may not be optimal, we may miss content or will take longer to harvest


If you have any questions, please contact us and we will be happy to assist.


Metadata

Supported Protocols (Required)

A data provider must support OAI-PMH or ResourceSync

Explanation:


In order for CORE to harvest a data provider, they must have an OAI-PMH or ResourceSync compliant endpoint. Most common software such as EPrints, DSpace or Open Journal Systems (OJS) support OAI-PMH.

For OAI-PMH, we support the following metadataPrefix:

  • oai_dc

  • rioxx

  • qdc

If you create a customised metadataPrefix which follows our guidelines, we recommend calling it dc_core.


<dc:identifier> contains an unabiguous url to the full text

Explanation:


dc:identifier must contain 1 url which links to the fulltext. Resolvers and registries such as handle.net and doi.org are accepted.


CORE supports an unlimited number of dc:identifier fields. We expect 1 of these fields to contain a direct link to the full text. The only limit is dependent on the application profile. For example, only 1 dc:identifier field per item is allowed in RIOXX.


EPrints follows this recommendation by default. It usually looks like this:


<dc:identifier>http://oro.open.ac.uk/37823/1/jcdl2013_v7.pdf</dc:identifier>

<dc:identifier> contains a url to a metadata page which provides a “download pdf” url

Explanation:

Some repository software (e.g. DSpace) inserts a metadata page into dc:identifier:

<dc:identifier>http://oro.open.ac.uk/37823/</dc:identifier>

We are not always able to identify the fulltext using this setup.


If we are able to find a fulltext, we run a process called “Title Matching” to ensure we have found the correct item. This compares the title in the metadata with the title in the fulltext. This process is not 100% perfect, causes extra load on the server and requires more processing time per document. OCR documents are impossible to harvest when OCR not been performed.


Following our “Direct link to the fulltext” recommendation avoids this situation. We are even able to harvest scanned full texts.


Resolvers and registries such as handle.net and doi.org are supported in this scenario.


Fulltexts

Fulltext format (Required)

The fulltext is uploaded as a pdf, doc or docx

Explanation:


CORE only support valid pdfs, doc or docx files where there is extractable text.


If a pdf is from a scan, ensure that the file has had OCR text fields applied to the file.


The fulltext is hosted on the same domain, or subdomains as the OAI-PMH endpoint.

Explanation:


To ensure we harvest the correct content, fulltexts must be hosted on the same domain, or subdomains as the OAI-PMH endpoints.


Exceptions are made where the domain is owned by the same organisation.


We are unable to harvest fulltexts hosted on shared services such as Dropbox, Google Drive or OneDrive.

 

Examples:

The fictional organisation “We Love Open Access”, has an OAI-PMH endpoint on 

https://weloveopenaccess.org/cgi/oai2.


 By default, we will accept any papers hosted like this:

Notice how the files and the OAI-PMH endpoint are hosted on the same domain (weloveopenaccess.org).

We can accept the following setup, but please contact us to allow a specific exception:

We may also accept where there is a clear link between the owner of the OAI-PMH and the hosting website:

 

However, we are not able to support the following hosting providers: