Process and analyse data

Last changed: 17 November 2020
Process and analyse data (Flaticon)

Once you are actively processing and analysing data, there are a number of important issues to keep in mind to not only ensure high data quality and research efficiency but also enable reproducibility, preservation, as well as sharing of data at a later stage.

Raw and processed data

Once you have collected, generated, and/or acquired data as part of your project it is time to start analysing it. However, data most often has to be properly and accurately prepared (i.e., processed) prior to its actual analysis. Data processing refers to all operations performed on data prior to analysis and may include actions such as data verification, organisation, cleaning, transformation, subsetting, integration, etc.

At this point, it is important to highlight the difference between raw and processed data. Raw data is the original source data; it is data not yet processed. To guarantee the data’s integrity and security as well as to avoid data loss, it is crucial to make a copy of the raw data (preferably as a read-only version) and to keep that copy separate from the processed and active data. That copy of the raw data must not be modified. Only once you start actively working on the working copy of the raw data does it become processed (or active) data.

Finally, once you are done with processing and analysing the data make sure to save a copy of both the raw and processed and analysed data in a file format that allows for its archiving and preservation. Note that for analysis purposes data is often analysed in specific computer programs, rendering data files in formats readable only by the specific programs that created the files (i.e., in a so-called proprietary format). Thus, it is crucial to make use of such programs’ export function, allowing to save the data in a format suitable for preservation and reuse. More general information on file formats can be found at Collect, organise, and store data, while Archive and preserve data provides guidance towards the role of file formats in preserving data.

Documenting data and version control

As you conduct research, it is vital to document the data but also describe and record its processing and analysis. Producing high-quality documentation of your research ensures that data can be found, understood, reproduced and validated as well as reused by either you, your future you, your collaborators, or others.

Metadata

Documenting and annotating data generates structured information also known as metadata. Metadata adds value to the data and is critical to effectively manage, use, and reuse data as it provides information on content, structure, and context. Metadata, moreover, plays a crucial role in making data FAIR (Findable, Accessible, Interoperable, Reusable; read more on FAIR data and metadata on the GO FAIR web site). Lastly, note that many computer programmes used in managing as well as processing and analysing data support or include built-in functionalities to record and document metadata (including the application of metadata standards) (e.g., Colectica for Excel, ArchGIS, NVIVO, etc.). For further information on documenting data and creating metadata, see Collect, organise, and store data.

Reproducibility

Just as important as creating metadata for the data itself is it to document and record all the actions and steps taken while creating, processing, and analysing it (i.e., show your workflow). Showing your workflow means recording and documenting all the changes made and all the various analyses run; showing your work is just as important as showing your results. Documenting and recording the various data creation, processing, and analysis steps ensures verification of the data as well as validation and reproducibility of your research results. Note that should you develop computer code (or software) as part of your research activity, you need to regard it as data as it is an essential part of the research process and, thus, should be documented, preserved, and shared. The availability of the computer code behind a project (if any) is an enabler of reproducibility. See Jisc’s web page on software for information how to manage it properly.

Within the life sciences, a widespread digital tool to record and document your workflow are electronic laboratory notebooks (ELNs). Another very useful methodology in this regard is literate programming that combines source code written in a programming language with rich text elements documenting the computational steps performed (R and Jupyter, for example, have developed powerful notebooks). Replication of analyses and reproduction of results can further be achieved using self-contained computing environments (i.e., docker containers). Such computational environments capture information about which dataset, tool, software (including version and all associated libraries), and operating system was used at the time of processing and analysing (see, for instance, the Code Ocean platform for capturing and sharing computational environments).

More about the various aspects of reproducibility can be read on the Yale University web page Curating for Reproducibility.

Version control

Datasets (just as files) often go through a number of versions. Also, datasets may be worked on by more than one person and over a longer period of time. To ensure the integrity of the data, it is important to keep track of and document the changes made. Version control is the way to track changes and revisions of a dataset, and is essential should your research involve more than one person. Version control prevents errors, increases data quality and work efficiency within a project as well as facilitates reuse of the data at a later stage. Keeping track of changes can be done either via applying a file naming convention, a version control table, or an automated version control system (e.g., git, Subversion, etc). More information on version control can be found at Collect, organise, and store data, while Archive and preserve data provides guidance on which files and versions of files need to be kept or are may be disposed of.

Handling personal or sensitive data

Data containing personal or sensitive information has to be handled according to data protection, freedom of information, and archives legislation (see Collect, organise, and store data for advice on information security and data protection as well as SLU’s internal web page for guidance on how to protect data and personal data).

The manner of how such information may be made public depends on, among other things, the consent given in the informed consent form (information on and templates of informed consent can be found at SLU’s Integrity and Data Protection Function’s web page More on data protection). Should you neither be allowed to disseminate information that could imperil the confidentiality of research subjects or objects nor have obtained informed consent prior to dissemination, a separate version of the data needs to be created in that identifiable information is removed (anonymisation) or replaced (pseudonymisation). Personal or sensitive data may be identifiable directly or indirectly. An individual may be identified directly due to name, address, telephone number, photography, or some other unique personal characteristic – alone or in combination. Indirect identification of an individual, on the other hand, may occur when certain information is linked together with other, more general information, such as place of work, job title, salary, or postal code. You can read more on what identifiers and related factors with regard to personal data are on the web page of the UK's Information Commissioner's Office.

Anonymisation and pseudonymisation

Anonymisation, or de-identification, is the process of removing all information that may lead to an individual being identified. Pseudonymisation, on the other hand, means processing personal information in such a way that the data can no longer be linked to a specific data subject without the use of additional information. This often includes substituting identifiers (e.g., social security number) with other values in such a way that they can be matched back to the identifiers by means of a translated code key. Such additional information (e.g., translation of consecutive numbers into social security numbers) must be kept separately and stored safely with limited access.

Should you need to be able to identify individuals or objects during and after the project, for follow-up purposes for instance, data should be pseudonymised rather than anonymised. There may also be other reasons, where anonymisation may not be possible or appropriate. In such cases, it may be possible (depending on consents, etc.) to publish a pseudonymised version of a dataset and preserve the original dataset with restricted access for release on request if permitted by the Public Access to Information and Secrecy Act (SFS 2009:400).

Sensitive data can also refer to non-human data, such as geospatial data that reveals the location of rare, endangered species or other confidential objects. In such cases, there are methods available to mask or distort the data in order to de-sensitise them as required before making them available. As an example, approximations of data values for geographical coordinates could be used rendering identification of the location of sensitive subjects/objects impossible without access to the original data.

Information about techniques for anonymisation and pseudonymisation can be found at UK Data Service’s web page.

Page editor: dcu@slu.se