Collect, organise, and store data

Last changed: 17 November 2020
Samla, organisera, lagra data bild  (Flaticon)

The data you collect, generate, and/or acquire is the key building material of your research. Organising, documenting, and describing that data in a systematic manner right from the start saves both time and energy, and, more importantly, has a real impact on the data’s quality. In addition, storing the data properly and securely will impact its integrity and further maximise its value.

Collecting new data and/or reusing existing data

When designing and planning a research project, there are a number of things that you need to take into consideration with regard to collecting, generating, and/or acquiring data. Addressing these issues early on in your project can save you time and effort later.

Firstly, depending on your research question, you may be either working with primary data or reusing secondary data or utilising both. Primary data is data that is collected and/or generated as part of your own research project. Secondary or third-party data, on the other hand, is data that already exists and that has been collected and/or generated as part of someone else’s research project or activity (go to Discover, reuse, and cite data for advice on how and where to find and reuse already existing data). Whether you will be working with new data, reusing secondary data, or handling both, note that collecting and generating new data is a far more costly activity than reusing third-party data. As a consequence and in an effort to minimise unnecessary costs, funders nowadays require applicants to check whether or not the data they intend to collect and/or generate already exists. Hence, think carefully about what data to collect, generate and/or reuse.

Secondly, it is important to consider both legal and ethical rights with regard to the data you intend to collect, generate, and/or acquire. For instance, should you, as part of your research project, be collecting data containing personal or sensitive information, you need to ensure that you comply with data protection legislation and ethical guidelines (see section ‘Information security and data protection’ below for more information in this respect). When it comes to reusing already existing data, you will need to make sure you have permission to actually reuse such data. Make certain you address all ethical, copyright, and rights management issues (e.g., intellectual property rights) that may apply to the data you will be working with (Discover, reuse, and cite data provides further information in this regard).

Organising and structuring data

Once you collect, generate, acquire and, eventually, start processing data, it can quickly become disorganised. To save time and prevent errors later on, you should decide on how you will organise and name files. Choosing a logical and consistent system for how to organise and name files allows you and others to locate, identify, and retrieve files quickly and accurately. Ideally, the best time to think about this is at the start of your project. Thus, file management is fundamental to good data management.

Folder/file structure

Establishing a system that allows you to access files, avoid duplication, and ensure that the data can be backed up, takes a bit of planning. A good place to start is to develop a logical, hierarchical folder structure that represents the structure of your research. Doing so helps you sort files into a series of meaningful and useful groups with common properties. Such a folder structure may be organised by, e.g.,

  • project/subproject/activity,

  • experiment/survey,

  • phase (of a project/subproject/activity, etc.),

  • type of data,

  • date/time or date/time range,

  • other categories specific to your project.

An example of a folder structure can be found on the UK Data Service’s web page on organising data.

Folder/file name

For your folder structure to actually become useful, you will need to apply meaningful and appropriate names to both folders and files. Folder names, on the one hand, should concisely convey a folder or subfolder’s content. A file name, on the other hand, is a file’s principal identifier. Good file names provide useful clues to the content, status and version of a file, uniquely identify a file, and help in classifying and sorting files. Keep in mind that names should be unique in relation to not only other files within the same folder but also to all the files of the same project. Hence, when naming files, you should do so following a consistent, logical, meaningful, and predictable manner (in other words: use a file naming convention; that is a framework for naming files in a way that describes what they contain and how they relate to other files). Elements that should be considered when naming a file are, e.g.,

  • date of creation (use the ISO 8601 format: YYYY-MM-DD),

  • description of content,

  • version number,

  • name of creator,

  • parameters of folder structure.

To avoid problems in managing files in different systems, use POSIX portable filenames: that is do not use characters other than A–Z, a–z (e.g., avoid Swedish characters ä, ö, and å), 0–9, full stop, underscore or hyphen, and avoid filenames starting with a hyphen.

See the University of Edinburgh’s 13 file naming rules or the UK Data Service’s best practice on how to name a file.

File format

A file format describes the way information is organised in a digital file. It is important for understanding how a file can be accessed and read and how its data can be used and possibly integrated. A file’s format, furthermore, plays an important role in the data’s archiving and preservation (see Archive and preserve data for further information in this respect). Hence, choosing the appropriate file format is key to ensuring that data is interoperable and reusable. When choosing a file format, you should opt for a consistent format that can be read well into the future and is independent of changes in (software) applications. An appropriate, sustainable format should be

  • open or standard (i.e., non-proprietary and, hence, independent of a specific software),

  • commonly used,

  • unencrypted (i.e., well documented and has an open technical specification), and

  • lossless (i.e., uncompressed so that the original detail of the data file is retained).

In practice, it may not always be possible to use formats that fulfil these criteria, especially at the data collection and/or data analysis stages. In such cases, it may be necessary to migrate/convert the data to a more appropriate, sustainable format at a later point in time (i.e., when preparing data for archiving and preservation or sharing). The Swedish National Data Service (SND) provides a list of preferred and acceptable formats for text, spreadsheet data, audio, video files, etc. For more detailed information on file formats see the Australian National Data Service’s Guide.

File versioning

Very few files are drafted by one person in one sitting. More often there will be several people involved in the process and it will occur over an extended period of time. Without proper controls this can quickly lead to confusion as to which version is the most recent. Version control can help you to differentiate between versions of and to track changes within files. There are a number of ways that version control can be managed:

  • file naming - using a revision numbering system (e.g., v01 for the first version, v02 for the second, etc.) to track changes to a file (example: 2020-07-28_ProjA_DMP_v01.docx, 2020-08-28_ProjA_DMP_v02.docx) and/or using initials (e.g., John Doe, JD) to identify who has made the changes (example: 2020-08-28_ProjA_DMP_v02_JD.docx).

  • version control tables – containing information such as version, date of change, name of person who made the change, and the nature and purpose of the change. Version control tables are included within the document itself.

  • version control systems – are automated systems that monitor access and log changes made to a file. Examples of version control systems are git or Subversion.

More detailed information about file versioning can be found at UK Data Service.

A good idea – especially when working in collaboration with others – is to document your file management system (i.e., folder structure, file and folder naming convention, file versioning, and choice of file format) in a supporting ReadMe file. It is recommended to place such a ReadMe file within the top-level folder of your project, where it can be found easily by everyone involved. Note: make sure to update the ReadMe file upon changes to the file management system.

Documenting data

Data, if documented and annotated correctly, will have significant ongoing value and can continue to have an impact long after your research project has been completed. Thoroughly and accurately describing and attributing the data will help you, your future you, your collaborators, as well as others find, understand, validate, and/or reuse it. For advice, start early on with describing and annotating the data – as long as the information is present.

Describing and annotating data generates structured contextual information, so-called metadata. Metadata is information about the data and is one of the most important aspects of research data management. Metadata explains the context behind the structure and content of the data. It is about describing and characterising elements of the data itself and should answer questions such as ‘Why was the data collected/generated/acquired?’, ‘Who collected/generated the data?’, ‘Where was the data acquired from and according to what license?’, ‘Where, when, and how was the data collected/generated/acquired?’, ‘What is the content of the data?’, ‘How was the data assessed/evaluated?’, etc.

Data description at different levels

Ideally, you should describe and annotate metadata at different levels:

  • project level – include information regarding the aim and research question of a research project; document what data is collected, generated, and/or acquired and how; provide details on the type (e.g., observation, experiment) and nature (e.g., numerical, textual, audio, video) of the data; describe the methodologies of data collection as well as any abbreviations used

  • file/dataset level – include content, title, creator, date, format, version, data structure, and file relations in your documentation; describe tools and abbreviations

  • data item/variable level – circumscribe variables and values, field names, units of measurements, classifications; document code written and abbreviations used.

There are different ways in which you can document data depending on the context within which it is being collected, generated and/or acquired. Certain metadata about a file or data may be embedded within the data or document itself (some file formats can record information in addition to the main data content, e.g., XML), while some may be part of a separate, supporting document, such as a “ReadMe” text file (check out an example ReadMe file from Bath University Library). See Process and analyse data for ways of documenting not only data but also changes made and analyses run during the active stages of data processing and analysis. In any case, make certain to update such corresponding metadata or files upon changes to the respective file or data.

Metadata standards

In practice, creating metadata means describing and characterising elements of your work, of the data itself. However, finding out which elements you should use and when can be a daunting task. One way of creating metadata for your work is to adopt a metadata standard. Applying a metadata standard when describing and annotating the data allows for finding the data more easily and combining and/or comparing it with/to data from different research projects. These standards can, furthermore, be helpful in terms of what data to collect, what vocabulary to use, or what units of measurements to apply, etc. Making use of a metadata standard allows for a more structured description and annotation of the data and, as such, enables the metadata to be machine-readable. Thus, try to use terms from a standard vocabulary or ontology as you describe data. Linked Open Vocabularies, Agroportal ontologies, or FAIRsharing.org are examples of where you may be able to find suitable terms including definitions.

In many research communities there are agreed upon vocabularies and models for how to describe data. Make sure to use a community metadata standard where such is in place. For more information on discipline- or community-specific metadata standards as well as links to such, please visit this web page maintained by the Digital Curation Centre (DCC).

FAIR data and metadata

Structured and standardised (and thus machine-readable) metadata is important for archiving and preserving data in a FAIR manner. FAIR in relation to data means that data is Findable, Accessible, Interoperable, and Reusable. A set of FAIR guiding principles for scientific data management and stewardship has been developed, which are meanwhile widely endorsed by governments, funders, journals, publishers, and research communities in order to maximise the value and utility of research data. Read more about FAIR data and metadata at the GO FAIR web site.

Storing and backing up data

“Good data storage and backup practice begins with planning for disaster” (anonymous source).

Data Storage

Without data, no research! It is, therefore, essential that you take on adequate measures regarding data storage. Choosing the right way to store data can help you work more flexibly, easily, and quickly. Good data storage practices can prevent loss of data, simplify version control, and enable effective collaboration. See the UK Data Service for how to best store research data.

As an employee at SLU, you can choose among a range of different storage options. When deciding how and where to store data, you need to consider the following factors: nature of the data (numerical, descriptive, aural, visual, etc.), type of data (e.g., confidential, sensitive, and personal data), volume of the data, and data access needs. Note that SLU personnel are allowed to use cloud services offered through Office365 (including OneDrive and Azure) for storing data, given that legal, ethical and other requirements are satisfied (see SLU’s document “Molntjänster på SLU” at SLU’s “Public 360”; information available only in Swedish). Please contact your department’s IT coordinator or SLU’s IT department to help you choose the right data storage plan.

Data Backup

The best way to keep research data safe is to consistently back it up. Backups are an important instrument to ensure that data and related files can be restored if required. Backups can protect the data from hardware and system failures, software faults, thefts, and more. You will need to think about what to backup, in what format, how many copies to keep, and how long to keep them for. Thus, make sure the data is stored in places that have routines for backup and security in place. Check out DataOne’s advice on how to best backup data.

At SLU, the IT’s storage media are routinely backed up. Also, data that you keep in the “Documents” folder of your computer are backed up routinely (note: only Windows computers are, however, currently included in SLU’s central backup routine). Finally, storing data on local or removable disks (e.g., local hard drive or USB memory stick) should be avoided since these are not covered by SLU’s central backup routine.

When collaborating with researchers outside of SLU, you should consider signing an agreement with all parties involved regarding responsibility for data storage and backup, data access, management of data, and data security. SLU’s Legal Affairs unit can help you with information, advice, and templates in this regard.

Information security and data protection

Information security and data protection is all about complying with data protection, freedom of information, secrecy, and archives legislation. It is relevant to protect intellectual property rights, commercial interests, or to keep personal or sensitive information safe.

When it comes to information security you must ensure that only authorised people can access to read, edit, or use the data. Doing so should mean that the data will be safe from unauthorised access and use. In order to decide how to properly safeguard data, you need to classify the information and data according to SLU’s three information security aspects confidentiality, integrity, and availability (information available only in Swedish). SLU provides further advice on how to keep information and data on computers etc. secure.

Data containing personal or sensitive information needs to be treated with higher levels of protection than data that does not. The EU General Data Protection Regulation (GDPR) recognises the protection of persons in relation to the processing of personal data as a fundamental right. SLU’s Data Protection unit can help you with additional information on data protection and personal data. Note that if you plan to collect and process personal data you need to fill in SLU’s report on the processing of personal data.

Collecting and processing data from human participants is, among other things, subject to their informed consent and is often sought via a form. This should explain what participation entails, how results and the data will be disseminated, and the impact of the project on participants. Note that gaining informed consent is highly important with regard to future dissemination of data. Also note that in order to facilitate future data dissemination it is highly recommended to collect no more personal data than is absolutely necessary. Obviously, by its very nature, consent information is personal information and should not be available openly. The UK Data service provides extensive guidance on informed consent. Templates of consent forms can be found on SLU’s Integrity and Data Protection Function’s web page. If, however, you intend to make personal information publicly available with or without prior consent, it is important to take on measures to protect such information by means of, for instance, anonymisation or pseudonymisation (see Process and analyse data for more information in this regard).

Public access to information during ongoing research

Material produced as part of research as well as environmental monitoring and assessment activities carried out at SLU is property of SLU. And, because SLU is a public institution, such material become official records/documents to which the public is guaranteed access under the Freedom of the Press Act (SFS 1949:105; especially the principle of public access to official documents) (see also the Legal Affairs unit’s web page on Official records, public access, and secrecy in this respect). This means that anyone can request access to such material, even during the active stages of a project (e.g., data collection). Public requests may, however, in some cases be denied; that is access to official documents may be restricted for secrecy reasons according to the Public Access to Information and Secrecy Act (SFS 2009:400). Applicable reasons to restrict access may be, for instance, that data contains personal or sensitive information or information pertaining to protected species. Working material, on the other hand, such as draft documents, notes or preliminary processed material, are generally not considered official documents. If shared outside SLU, however, they normally become official documents. Read more about official records in research in the manual for managing research material.

Research material that has the status of an official record needs to be managed according to the same principles as other official records that is such documents have to be accessible and in good order to be able to be released to the public upon request (see the Archives Act, SFS 1990:782; information only available in Swedish). Such research material has, hence, to be archived and preserved for the future. More information in this regard can be found at Archive and preserve data as well as on the Archives, Information Governance, and Records unit’s Management and preservation web page.

Page editor: dcu@slu.se