On this page you can read about the difference between raw and processed data, about how version control can help you avoid errors and increase your data’s quality, and about how personal data and sensitive information should be handled in research and environmental monitoring and assessment projects.
Raw and processed data
Data usually has to be processed (i.e. prepared) prior to analysis. Data processing may include actions such as data verification, organisation, cleaning, transformation and subsetting.
It is important to highlight the difference between raw and processed data. Raw data is the original source data, i.e. data that is not yet processed. Once you start actively processing the data, it is called processed (or active) data.
To guarantee integrity and security of the data and to avoid data loss, it is crucial to make a copy of the raw data (preferably as a read-only version) and store safely. The copy of the raw data must not be modified and should be kept separate from the processed data.
Once you have finished processing and analysing the data, make sure you also save a copy of your processed data in a format that can be archived and preserved. Some computer programs used for processing or analysis uses proprietary file formats (which can only be read by the specific programs that created the files), but it may be possible to also export and save versions of the data in a format suitable for preservation and reuse.
Documenting data (creating metadata)
For further information on documenting data and creating metadata, see Collect, organise, and store data.
Datasets and data files often go through a number of versions. Also, data may be processed by more than one person and over a longer period of time. To ensure the integrity of the data, it is important to keep track of and document the changes that are made. Version control is a way to track changes and revisions of a dataset.
Version control prevents errors, increases data quality and efficiency within a project, as well as facilitates reuse of the data at a later stage.
A good idea – especially when working in collaboration with others – is to document your file management system (i.e., folder structure, file and folder naming convention, file versioning, and choice of file format) in a supporting ReadMe file. It is recommended to place such a ReadMe file within the top-level folder of your project, where it can be found easily by everyone involved. Note: make sure to update the ReadMe file upon changes to the file management system.
There are a number of ways that version control can be managed, e.g. using file naming, version control tables or version control systems.
Using a revision numbering system (e.g., v01 for the first version, v02 for the second, etc.) to track changes to a file (example: 2020-07-28_ProjA_DMP_v01.docx, 2020-08-28_ProjA_DMP_v02.docx) and/or using initials (e.g., John Doe, JD) to identify who has made the changes (example: 2020-08-28_ProjA_DMP_v02_JD.docx).
More detailed information about file versioning can be found in:
Version control tables
Containing information such as version, date of change, name of person who made the change, and the nature and purpose of the change. Version control tables are included within the document itself and contain information such as version, date of change, name of person who made the change, and the nature and purpose of the change.
Version control systems
Handling personal and sensitive data
Data containing personal or sensitive information must be handled according to data protection, freedom of information, and archives legislation.
It may be possible to connect personal or sensitive data to a specific individual directly or indirectly. An individual may be identified directly by their name, address, telephone number, photography, voice, or some other unique personal characteristic, found alone or in combination. Indirect identification of an individual may occur when information is combined. For example, knowing someone’s age and place of work, or height and postal code may make it possible to identify a specific individual, even if their name is not revealed. One way to make sure personal, or sensitive, information is not revealed accidentally is to anonymise or pseudonymise the data by removing or replacing the identifying features (more information below).
Sensitive data can also refer to non-human data, such as geospatial data that reveals the location of protected species or other confidential objects. In such cases, there are methods available to mask or distort the data in order to de-sensitise them as required before making them available. As an example, approximations of data values for geographical coordinates could be used rendering identification of the location of sensitive subjects/objects impossible without access to the original data.
Anonymisation and pseudonymisation
One way to make sure personal, or sensitive, information is not revealed accidentally is to anonymise or pseudonymise the data by removing or replacing the identifying features.
Anonymisation, or de-identification, is the process of removing all information that may lead to an individual being identified. If you need to be able to identify individuals or objects during and after your project, for follow-up purposes for instance, data should be pseudonymised rather than anonymised. Pseudonymisation is when personal information is processed so that the data no longer can be linked to a specific data subject without the use of additional information. This often involves substituting identifiers (e.g., social security number) with other values in such a way that they can be matched back to the identifiers by means of a translated code key. Such additional information or code lists must be kept separately and stored safely with controlled and limited access.