5. Data Documentation
Learning objectives
When you have completed this module, you will be able to:
- Understand what documentation is and why it is important.
- Describe what metadata are.
- Reflect on how to document your project so that it can be replicated by others
____________________________________________________________
Why is good data documentation important?
As we have learned in lesson 2 ‘Research data -types formats and sizes’, research data come in various types and sizes, from small qualitative datasets obtained from a few interviews and presented in a single text file, to very large datasets with quantitative gene sequence data stored in databases. What these different data sets have in common, is that they must be well-described with contextual information. Without contextual information, others, including your future self, will not be able to understand what the data describe, when and where the data were created, how the data were collected and for what purpose. In other words, without contextual information, data are meaningless.
Having a good strategy for documenting your project and your data will ensure that you will still understand your data in the future, and will enable you to return to your methods of collecting data, review the processing that you did or revisit other important pieces of information in case you need to. It may be that you are contacted by fellow students with questions about your project or a supervisor who wants to know if you have followed a specific set of guidelines. It could also be that the reviewer of your thesis or article has questions about your statistics or the variable labels in your dataset. Having good documentation will allow you to provide such evidence of your work and insights into your project, and it will allow your data to be reused in new projects, by yourself or by others.
____________________________________________________________
How and what to document
What documentation you should have in your project, will depend on the type of research conducted and the type of data collected.
Examples of documentation
|
It will be up to you to determine what documentation is relevant for your type of project. The rule of thumb is that you should as a minimum have all the documents necessary for someone to replicate your project. And these documents should be described in such a way that others understand them, so use clear language, avoid non-standard abbreviations, stick to terminology commonly used in your research field, give your documents logical files names, make sure to add dates, etc.
Documentation should be stored in a place where you (and possibly others) can find them again in the future. As much as possible, keep your project documentation stored in the same location as your data, with clear references to the data sets they refer to. Please note, however, that documentation and data sometimes cannot be kept in the same place, especially when documentation can lead to the identification of human participants in your project.
Lastly, a data management plan can be used as a master document to keep track of all the different research-related documents if you ensure to refer to all your documents in the DMP.
Metadata
‘Data documentation’ is a broad term that encompasses all sorts of information and materials that describe and explain both the data and the research project in which data are collected. Metadata are an important part of data documentation. Metadata are data about data, or in other words, information that describes, explains, locates or otherwise provides context for data.
Let’s look at an example of metadata in the context of the music service Spotify:
|
Like in the Spotify example, the primary purpose of metadata in research is to facilitate the organization, discovery, and management of research data. Metadata provide essential details such attributes that help people understand and use the data effectively.
We can distinguish between various types of metadata, including:
-
- Administrative metadata: metadata that provide information about the origin and source of the data and are relevant for managing the data. Examples are name of the project leader, funding provider, project period, contact information, project information, data collection dates, data access restrictions.
- Structural metadata: metadata that provide information about the organisation, relationships and internal structure of the data. Examples are file format, unit of measurement, sample size, data set version, categories and variables.
- Descriptive metadata: metadata that provide information about the content, context and characteristics that allow others to discover and identify data. Examples are the title of the dataset, the creator of the data, keywords describing the data, abstract and link to the data.
Metadata standards
When describing your research data, stick to common practices in your research discipline as much as possible. This will help others (including your future self) to understand the data and to possibly reuse them in new projects. Metadata standards are guides to metadata descriptions that are agreed upon in your research discipline. These could be guides to describe what metadata to include in your data description. For example, when you adhere to the metadata standard Dublin Core to describe a digital object such as a photo, it means that you as a minimum will include information about the creator, format and date (which are 3 of the 15 metadata elements in Dublin Core). When you conduct surveys or observations in social, behavioural, economic and health studies, you could choose to adhere to the Data Documentation Initiative (DDI), for example, to standardize the generation of your code book(s). Read more about how to apply metadata standards in this useful guide from KU Leuven
Links to an external site.. Talk with your supervisor and discuss if it would make sense for you use a metadata standard to describe your data.
ReadMe files
Metadata can be generated automatically. When you, for example, fill in an online form to upload data in a data repository (see lesson 7 ‘sharing data openly’), the data repository will ensure the information in the form is presented as metadata in the data record. When you use equipment to generate data (for example, a video camera or laboratory instrument) the equipment will automatically add metadata such as date and time to your datafile.
However, you can also manually add metadata, for example by using a supplementary file that you keep stored along with the dataset. One such file could be a ReadMe file. A ReadMe file is a simple text document (often named ReadMe.txt, or ReadMe.md) that is associated with a dataset, software project, or any collection of files. The purpose of a ReadMe file is to provide essential information about the contents, usage, and context of the data or project. It serves as a quick reference guide for your future self and for others, helping in understanding and navigating the dataset or software.
It is a good idea to produce a ReadMe file for all datasets you create, and keep it stored with the dataset it refers to. You can use a ReadMe template, such as this one Download this one. Alternatively, you could make your own ReadMe file. If you make your own, consider including the following elements:
- General project information: title of the study, people involved and their roles.
- Methodological information: methods of data collection and analysis, instrument calibrations etc. When describing how the data were collected, add information about eligibility criteria, and selection criteria. Name and refer to any instruments, tools, theories, methodological framework and materials or tasks you gave to the participants that were part of the data collection. It is important to understand how you designed your data collection since the construction of any instruments or tools you used to collect the data is related to any errors that might occur. Note who collected the data, and when and where the data were collected.
- Process information: Add a very brief description of how you processed and analysed the data. Identify the format of the data. Name the version of any software you used and how you have accessed the software, for example through the university library. The access you have to a software can determine which analyses you can conduct and how much data can be analysed. Include how you edited and cleaned the data and any statistics you used. Describe the coding framework, how open-ended questions or observations were coded and any quality assessment procedures applied.
- Data specific information: List the variable names and their definitions, units of measurement, column headings in an excel sheet, setting and resolution information about images, etc. For example, describe how you have defined and coded the “gender” of the participants in your project and list the value labels you use to differentiate gender properties such as male, female, and gender fluid and finally the response codes for data processing, 1 = male, 2=female, 3=gender fluid. Dependent on the sensitivity of the data, you might not even be able to share a metadata description of the data as even this description may expose your participants.
- Information on how to reuse and share your data: State the rights others have to use and share the data. If you have given your data a usage licence, make sure to state it. Learn more about rights and responsibilities in lesson 3 and how to responsibly share your data in lesson 7. Keep documentation of informed consent with your copy of the data.
- Preservation information: State the format of the data such as a spreadsheet in CSV, a text file, a PDF or images in a TIFF, and where, how and how long the data are stored.
___________________________________________________________
Version control
Version control is part of the documentation process too. Version control is a process of recording and organising changes to documents, papers, books, catalogues, computer programmes, code and websites and much more. Working with versioning means that old versions of files are saved and stored, so nothing is lost and changes can be reverted back to a previous version if need be. This is especially important when multiple people work on the same files. Version control helps you go back in time to see exactly who wrote what on which day and at which time.
Here are four approaches to versioning:
Approach #1: Use file naming conventions
A file naming convention is a systematic and standardized approach to name files consistently. So instead of naming files “finallyfinalversion.doc”, design an approach for how you will name your files, for example “date_keywords_author initials_version.format”, for example “20240311_effect smokeonsleep_SDB_V3.xls”, where date is recorded as YearMonthDay and you save a new version every time you make (big) changes to the document.
Approach #2: Add version control tables
Include a version control table somewhere in your document (often on the documents’ cover page). This would detail the various versions of the document, what changes were made, who made the changes and when.
Version |
Author |
Purpose/Change |
Date |
1 |
Susanne |
Original document |
2024.01.03 |
1.1 |
Susanne |
Table 1.3 added to the document |
2024.01.13 |
1.2 |
Katrine |
Methods description edited |
2024.01.30 |
2 |
Susanne |
New discussion session added |
2024.02.15 |
Approach #3: Use data storage and file sharing software with built-in version control
In some cases, file sharing and collaboration software can track file changes. At UCPH, it is among others recommended that you use Microsoft One Drive for the storage and sharing of data files (see lesson 6 ‘data storage and security’). Microsoft One Drive offers versioning control, automatically tracking changes made to files and creating a version history. Users can view and restore previous versions of files.
Approach #4: Use a version control tool such as Git in combination with a Git repository manager
Git is a free version control software to be employed via the command prompt in Windows, or the terminal on Mac/Linux. Git allows multiple people to work on files simultaneously, in particular, files containing source code. Git allows users to push and pull information about code changes to and from central code repositories such as GitHub, GitLab or Bitbucket. These repositories serve as a centralized hub for storing and organizing projects, and for keeping track of any changes made in the code. You can find a Git tutorial here
Links to an external site..
____________________________________________________________
Documentation in practice
Now let’s look at some examples of documentation of data at the UCPH.
Master thesis about youth quotas in parliaments and political parties
Malthe Rugberg Andersen, Faculty of Social Sciences, talks about the importance of documentation when working with survey data.
If you experience access denied, reload the page or try another browser
For English subtitles, please look for the CC icon in the lower right corner of the video and press English.
Master thesis about reducing risk factors in obese people through weight loss and/or medicine
Rasmus Michael Sandsdal, Faculty of Health and Medical Sciences, talks about documentation and file-naming conventions in his thesis project which was part of a larger research project.
Master thesis on modelling of grain quality
In his Master project, Jakob Riber Rasmussen, Faculty of Science, reused already existing data produced by a commercial company, FOSS. Jakob talks about the importance of metadata when working with data produced by someone else.
____________________________________________________________
Test yourself
Check whether you captured the main points of this lesson:
____________________________________________________________
Continue with your DMP
Please continue working on your project’s data management plan (DMP) by answering the questions in section 5, Data documentation:
5.a In short, describe your methods for collecting and processing the data/material. If a detailed methods description already exists, you can refer to it instead.
5.b Describe how you will organise and structure your digital data/material, how you will name your files and keep track of different file versions.
5.c Describe how you will document your project and the data. What information will you record about your data and material to ensure that your project/data/material is understandable to others?
If you haven't begun filling out your DMP yet, you can find the DMP template here: UCPH DMP Template for Students
Download UCPH DMP Template for Students
Remember to discuss the data management plan with your supervisor at the start of your project. Keep the DMP stored along with your data.
____________________________________________________________
Practical tips and resources for data documentation
- Use a data management plan to describe your approach for data documentation and list the various documents you produce in your project, as well as their storage location. Update your DMP every time new documentation is produced. In this way your DMP will function as a manual for your project. Discuss with your supervisor whether they should have access to the DMP. You can use the DMP template developed for this course: UCPH DMP Template for Students
Download UCPH DMP Template for Students
- Store your data file(s) along with ReadMe file(s) that explain the data set(s). You can use the ReadMe template
Download ReadMe template that we have developed for this course (adapted from a material by the University of Cornell).
- If necessary, you can ask the UCPH (KUB) library for support in designing a systematic approach to organising your data, ensuring reproducibility of your data and how best to document your work. KUBs Datalab helps students with data in R, Python, OpenRefine, Excel and more with regards to harvesting, cleaning, analysing and visualing data. Its primary focus is to support the responsible application of digital tools and methods to assist in research, and to offer counsel in regard to the strengths and weaknesses of these tools and methods. Contact KUBs Datalab via their webpage
Links to an external site. or kubdatalab@kb.dk.
- Look up data management related terms in the RDM Glossary.
____________________________________________________________
Learn more
Below, you will find some external resources for some optional further reading on some of the topics mentioned in this lesson.
- (n.d.). Documentation and metadata - CESSDA training. Retrieved November 14, 2024, from https://www.cessda.eu/Training/Training-Resources/Library/Data-Management-Expert-Guide/2.-Organise-Document/Documentation-and-metadata Links to an external site.
- Finnish Social Science Data Archive (FSD). (n.d.). Data Description and metadata. Finnish Social Science Data Archive (FSD). Retrieved November 14, 2024, from https://www.fsd.tuni.fi/en/services/data-management-guidelines/data-description-and-metadata/ Links to an external site. (Links to an external site.)
- RDM Support Desk, KU Leuven. (2022, March 22). FAQ RDM - Metadata and documentation. Research Data Management. Retrieved November 14, 2024, from https://www.kuleuven.be/rdm/en/FAQ/FAQ_Metadata_and_documentation#What%20is%20a%20README%20file Links to an external site.
- Guidelines for creating a README file - 4tu.federation. (n.d.). Retrieved November 14, 2024, from https://data.4tu.nl/info/fileadmin/user_upload/Documenten/Guidelines_for_creating_a_README_file.pdf Links to an external site.
- How to fair. FAIR. (n.d.). Retrieved November 14, 2024, from https://www.howtofair.dk/ Links to an external site.
____________________________________________________________
Published in 2024