Discover, reuse, and cite data

Last changed: 17 November 2020
 Find and reuse data (Flaticon)

Reusing data holds great promise. A fact acknowledged by many funders worldwide and who now require applicants to check whether or not data they intend to collect and/or generate already exists. This page has been designed to help you discover third-party data and what to look out for when reusing such data for your own research.

Reusing data is a process of several distinct – and often cyclic – steps. Before you search data resources for secondary or third-party data, you should develop a clear picture of the data you need and locate the appropriate places to look for such data. Once you have carried out a data discovery search and found data, you may want to select data that is more appropriate for your needs compared to other data. And in a final step, you should evaluate the quality of that selected data.

How and where to discover data

How to discover data

It is important to know exactly what data you are looking for. Ask yourself: “What data fits my intentions?”. Listing the characteristics of the data you want to discover makes it easier to

  • formulate the right search terms to find sources which hold such data and

  • search the data source of choice for adequate data.

Where to discover data

Once you have developed a clear picture of the data you are looking for, you will need to locate appropriate data resources that may host such data and find out how to search in that data resource of choice. There are many ways to discover, search, and find data. Below you will find some examples as a start.

Data from SLU

A selection of datasets collected, generated, and processed/analysed by SLU employees can be found at:

Data repositories

There are both general, interdisciplinary and discipline-specific repositories of deposited data. Listed below are some of the major interdisciplinary as well as discipline-specific data repositories of targeted interest to SLU employees.

Interdisciplinary data repositories and resources
  • SND National Research Data Catalogue – run by a consortium of Swedish universities including SLU and funded by the Swedish Research Council; holds both data and/or metadata

  • Zenodo – built and developed by researchers as part of the EU project OpenAIRE; allows publishing of all types of research artifacts (including data and metadata)

  • DRYAD – run by a community-led, non-profit organisation; hosting a wide diversity of data

  • Harvard Dataverse – developed by Harvard University; hosts both data and code from researchers all over the world

  • Figshare – a commercial product, where anyone can upload all types of data including research data

Discipline-specific repositories and resources
  • EDI (Environmental Data Initiative) – a National Science Foundation funded project, promoting and enabling curation and reuse of environmental data

  • EMBL-EBI (European Molecular Biology Laboratory-European Bioinformatics Institute) – provides a repository for publishing data sources in the life sciences

  • DataONE – community driven program and funded by the National Science Foundation that provides access to earth and environmental data

  • GBIF (Global Biodiversity Information Facility) – a global intergovernmental initiative aiming to advance free and open access to biodiversity data

  • ICOS Carbon Portal – developed by Lund University, offering free access to high-quality and standardised greenhouse gas data

  • PANGEA – a data repository within the earth and environmental sciences, hosted by the Alfred-Wegener-Institute and the University of Bremen and supported by the EU

  • SITES (Swedish Infrastructure for Ecosystem Science) – a national data portal for field-based ecosystem science that hosts data on basic meteorological and hydrological parameters as well as broader biological and chemical data

  • World Data Centre for Soils – hosted by the International Soil Reference and Information Centre

For identifying additional data repositories – both general and discipline-specific – you can search one of the following:

Data portals

Data portals harvest data from a large number of data providers, allowing you to search for data across several repositories and data sources simultaneously.

  • OpenAIRE – data portal to search datasets from studies funded by the European Commission

  • EU European Data Portal – a single point of access to open data from EU institutions, agencies, and bodies

  • Dataportal.se – provides access to open public sector information from different Swedish authorities, agencies, and institutions including the Swedish Environmental Protection Agency as well as SLU

  • DataCite – provides a service for searching among all datasets that have been assigned a DOI regardless of origin and subject (DataCite is the organisation that is responsible for creating DOIs)

Scientific literature and bibliographic databases

Scientific journals more and more require or explicitly encourage that data underlying articles published through them is made openly available. There are a number of ways to do so. Often, only parts of the data underpinning a publication is shared and published as an appendix (or as supplementary information) to the publication itself (see the European Commission’s web page on ’Facts and Figures for open research data’ for more detailed insight). However, data is thus only openly available as a result of open access publishing (i.e., many datasets published as supplementary information are not openly available but locked behind payment walls). Then again, published articles may contain a statement about where its underlying data is stored and how access may be gained. At times, data is also searchable together with its respective publication in bibliographic databases (e.g., through data citation), such as Web of Science or the SLU University Library’s search tool.

Data journals/data papers

A data paper is an article focusing on facts about a dataset (data collection, access, features, potential use, etc.) rather than on its respective analysis and surrounding research. Such data papers, provide you with information about what type of data was collected/generated. They, also, describe how such data may be accessed and reused. Just as with data repositories, there are data journals from within all kinds of research fields but also specific to certain disciplines. Some journals, furthermore, are specialised in only publishing articles about very interesting datasets. Some examples of such data journals are

Register data

Registerforskning.se is operated by the Swedish Research Council and provides researchers with information on existing registers, as well as support during the process of register-based research (e.g., information on each part of the process of identifying, requesting, and using register data).

Google Dataset Search

Much data is not formally published in a repository or portal. A general web search may reveal project websites, publications pointing to data, and contact details of the primary researcher. Increasingly, datasets are being shared and published with associated links to grants, software, etc. Besides finding data deposited in data repositories or listed in data portals, Google's Dataset Search can also help you discover datasets available via the abovementioned, rather unconventional and informal sources.

How to reuse data

Once you have discovered and found secondary data that fits your intentions of reuse, you will first need to check whether or not you are allowed to use, how you may reuse, and how to access that data. In a second step, it is highly recommended to evaluate the quality of the remaining data. Ask yourself what quality the dataset should be of for it to be useful to you and your purposes. Hence, in order for you to select third-party data appropriate to your needs, you need to get to know and understand that data properly.

Responsibilities and rights

When using third-party data, you have a responsibility to respect the rights that may be held with other people or organisations (including copyright, sui generis database rights, and ethical/moral rights). Normally, data is shared and published under licence or via waivers. The former generally defines the circumstances according to which data may be used, how the data used shall be attributed, and how newly generated data using the original author’s work shall be shared (i.e., under what licence). A waiver, on the other hand, is a legal document for giving up one’s rights to the data (read more about licencing research data on the Digital Curation Centre’s web page). Thus, check the terms and conditions of access and use, make sure the licence applied by the author, organisation, authority etc. is suitable for your purposes, and ensure to obtain any permission or consent that may be needed.

Assessing secondary data before reuse

Once you have found secondary data that fits your purposes and all legal as well as ethical rights have been cleared, it is time to closely evaluate it. Evaluating or assessing secondary data is much like evaluating the quality of a research paper. Consider factors that relate to the data’s reliability, validity, and quality, such as:

  • Is the source of the data clearly stated? Can it be trusted?

  • Who is hosting the data? Are data available in a sustainable repository?

  • Why was the data collected/generated?

  • Who collected/generated the data, when and how?

  • How was the data processed (does documentation exist)?

  • Is that data ‘clean’ (i.e., were non-logical and erroneous values deleted)?

  • What quality assurance procedures were used?

  • Is the data otherwise well documented?

  • Is the data well described with regard to its context (important for understanding the data)?

  • Does the data come with a permanent identifier that can be used for referencing the data? (see also Data citation below)

  • Do contact details exist in case further information is required?

It is advisable to assess the secondary data’s quality before actually downloading it. Taking a closer look at the documentation and metadata accompanying the data you intend to reuse (if such in fact exists), should provide you with some insight into the data’s quality.

Documenting the reuse of secondary data

It is highly important to provide documentation when reusing secondary data. Make certain to address the following questions when documenting the reuse of secondary data: What data has been reused? How was the data obtained (search location plus search query applied)? How was the data evaluated? Was it pre-processed prior to analysis? How was the data used within the new study? You should keep sufficiently detailed documentation about data and methods to enable other researchers to locate the original data and reproduce/validate your findings.

How to cite data

When reusing or referring to data, you should cite the dataset just as you cite a scientific article. This is best practice and typically a condition of the licence under which the data used was shared. Citing data is recognised as one of the key practices leading to recognition of data as a primary output in its own right. FORCE11 provides a set of guiding principles when it comes to citing data.

Styles and formats for citing data vary in the same way article citation styles and formats vary. In any case, a standard data citation should include the following elements:

Creators (Year of Publication): Title, Version. Publisher. Type of Resource. Principal Organisation. Identifier.

Example: Langvall, O. and Dahl, Å. (2019). Swedish historical phenology dataset.
Version 1. Swedish National Data Service. Available from https://snd.gu.se/en/catalogue/study/snd1105/001 [Accessed 04 June 2020]

Should you have difficulties in properly citing data, you can use the DOI citation formatter to automatically extract metadata from a DOI and build a full citation in various citation styles.

The Digital Curation Centre in the UK provides more detailed information about citing datasets.


Contact
Page editor: dcu@slu.se