Describe your data so they can be understood – about reusing data
Time
The lesson is expected to take about 20 minutes to complete. There are also reflection exercises, which are very good to do together with your fellow students.
About the lesson ‘Describe your data so they can be understood – about reusing data’
In this lesson, we will introduce you to the many concepts, considerations and processes associated with documentation and metadata. You will be given an introduction to what documentation can involve, and tips on how to start writing documentation.
The lesson will mention several concepts that may be new to you. You can see the definitions of these in the glossary.
Learning outcomes
When you have completed the lesson, we expect that you:
- can explain why the documentation of data and data management is important.
- can provide examples of what this type of documentation can consist of.
Source
The lesson was produced by the University of Copenhagen as part of its learning resources for digital literacy, 2023.
Course directors:
- Lorna Wildgaard (PhD), Specialist Consultant, Copenhagen University Library (KUB), Research Support Services
- Asger Væring Larsen, specialist consultant, Copenhagen University Library (KUB), Research Support Services
__________________________________________________________
Why is it important to document data?
What kind of information is required to allow others – or even you in 5-10 years – to understand and reuse data which you have collected and processed?
And what kind of documentation do you need if you want to use data which others have collected, processed and shared?
Enabling others to repeat one’s experiments or studies is one of the most important ways of building confidence in the scientific value of the results. It means that others can find out how you have collected your data, and perhaps try to repeat the study, the analysis, validate the work or simply ask better questions about the results. Good data documentation is therefore crucial for researchers and other professionals engaged in academic work being able to build on each other’s results and thus better address the challenges they are working with.
Data documentation must therefore contain information about what was studied, how, why, when and where the study was conducted and who performed the work.
As a UCPH student, you will be in a better position to do your assignments and sharper at sitting your exams if you take data documentation seriously – and become part of a strong scientific tradition in which transparency and collaboration are key elements.
__________________________________________________________
Data documentation
This video from Ghent University in Belgium explains why data documentation is important, the difference between documentation and metadata, and the different types of documentation of academic data.
The video briefly mentions the term ‘FAIR data’ (0:47). The term refers to a number of good data documentation principles that make academic data ‘findable’, ‘accessible’, ‘interoperable’ and ‘reusable’. If you are curious about this, you can read more at howtofair.dk Links to an external site. – but it is not necessary to do so.
__________________________________________________________
You learn about data documentation by assessing other people’s data
Much of the data you will be working with during your studies (especially at the beginning) will probably be data which are given to you by your teacher or which you download online in connection with teaching activities. They are so-called ‘secondary data’ which others have produced, for example your teachers, other researchers, public-sector bodies or organisations.
opendata.dk Links to an external site. is a good place to start, if you want to find openly accessible datasets.
Some of the secondary datasets which you will end up working with will probably be documented in advance. By reading other people’s data documentation to assess the validity and quality of their data, you learn a lot about what constitutes good – and poor – documentation. You can then draw on this experience when creating your own data and documenting the purpose and the process.
__________________________________________________________
Get off to a good start with documentation
In documenting your data, you describe how you collect your data, the methods and theories you use, and how you use them, as well as how you conduct your analyses. The purpose is to enable others to understand what you did and how. The documentation also includes a description of where and how you have stored your data, and how others can access the data. When talking about ‘data’ in this lesson, we mean your empirical data or evidence, which may be in the form of numbers, interview texts, notes from observations, images etc.
A useful tip: Write your data documentation while working with your data – not at the end and as you are putting the finishing touches to your written assignment. We humans are very quick to forget the small but important details!
The purpose of documenting your data is to provide sufficient information to ensure that your fellow students, teachers or others who are familiar with your field, but not necessarily with your project or assignment, are able to understand, interpret and reuse your data.
Together, the documentation associated with your project or assignment should answer a number of important questions such as:
- Which data or what evidence are you collecting?
- In what context are the data being collected?
- How are you generating or collecting the data?
- What is the form of the data (for example ‘interview transcript’)?
- How are the data formatted, structured and organised?
- How are you processing and analysing the data?
- On what ethical or legal terms (for example confidentiality and copyright) can the data be accessed/used/reused?
See examples of specific information which you can include in your data documentation
Depending on the nature of your investigations and the background context, you may need to include some of the following information:
- Details of the equipment used, such as brand and model, the equipment settings and information about how it was calibrated.
- Details of the methods or theories used, such as interpretations and models, and information about your perspective and how you have applied the theory or model.
- The wording of questionnaires, interview guides, subject guidelines or discourse analysis form.
- Details of who collected the data and when.
- Key features of the method, such as sampling techniques, whether the experiment was blinded, how participants were identified and how test groups were subdivided.
- Legal and ethical agreements concerning the data, such as consent forms, data licences and approvals.
- References to any secondary data used.
- Details about file formats.
- A glossary of the column names and abbreviations used, defining, for example, the variables in your dataset, the measurements or observations resulting in a given column and the units of measurement used.
- Methods for handling missing data.
- Which statistical analyses you have done.
- How you have described the variables in your questionnaire.
- The workflow you are using to process your data, including the use of statistical tests or outlier removal.
- Details of the software used to generate or process the data, including version number and platform.
__________________________________________________________
Formats for data documentation
There are different ways in which you can document your data depending on the context in which the data are collected:
- README file: A structured text file in which you describe your dataset and how it was collected and analysed. Read more about README files below.
- Electronic Lab Notebook: An Electronic Lab Notebook (ELN) is software that mimics the traditional paper lab notebook used by many researchers in their daily work.
- Logbook: A logbook is used to record your observations, interpretations and empirical data.
- Codebook: A codebook is used to describe the definition of the variables you use, their structural relationship, units of measurement, how you note deficiencies in the dataset if you have merged the variables into new categories etc.
- Data file: Some file formats can record information in addition to the data content. Image files register, for example, time, date, size, exposure time etc.
- Separate metadata file: Some academic disciplines have developed special file formats for registering supportive information. Standard terminologies are used in the data description in order to make the data description machine-readable and thereby ‘findable’ on the internet. You will probably only need these if you are working with data in a real research project.
You can often choose one or two of the above formats depending on how many types of data you have. A README file may be enough to document small amounts of data.
__________________________________________________________
Metadata
Metadata describe the content, context and provenance of a dataset in a standardised and structured manner. Usually, the purpose, origin, characteristics, geographical location, authorship, access and terms and conditions for using a dataset are described. You have already come across metadata. For example, in the form of descriptions of books in a library catalogue, product descriptions on Amazon and descriptions of you on your Absalon profile. Spotify music tracks come with lots of metadata: title, artist, year of release, length in minutes etc.
Metadata can be used to make the data visible and easier to find for both people and computers on the internet. The machine-readability of metadata is crucial for being able to find research datasets on the Internet in general and in various index services such as opendata.dk Links to an external site.. Good, machine-readable metadata is thus an important component of the FAIR principles Links to an external site., which aim to make research data easier to find and reuse.
It is not certain that the datasets you produce as a student will ever end up in a data repository – but some might. A data repository displays datasets in the same way that Spotify displays music so that the material is searchable, and so it is possible to read what it contains.
__________________________________________________________
How do you write a README file?
A README file is a standard text file called ‘README’ to encourage others to read the file before looking at your data. Even if a README is free text, the file should be structured in sections as a help to the reader. The following table summarises suggestions for what you can include. What you need to describe depends on your project or assignment, the dataset’s characters and format, and the context in which the data were collected.
We have prepared an example of a README file Download example of a README file, which you can download and use for describing your data. The file is very comprehensive and is meant as inspiration.
Section |
Suggested content |
Citation information |
Citation information is the information that is necessary to allow others to cite your dataset correctly:
|
Purpose |
|
Method/theory |
|
Secondary data |
|
Workflow |
Enter details about the steps you took to process the data:
as well as all the necessary settings for the software |
Output |
|
Your files |
|
Your dataset |
|
Access to your data |
Please provide a brief statement about the conditions on which others may use the dataset.
|
________________________________________________________
Take home messages
As mentioned previously, the process of documenting your data should start at the beginning of your project and continue throughout. Adhering to this principle will make the documentation task easier, and make it less likely that you forget the details of each sub-process later on. By documenting your data, you will also ensure that you and others will be able to interpret and evaluate your work – both fellow students, teachers, supervisors and external examiners.
What you include in your documentation depends on your project and the data types you collect and generate. The documentation can – as described above – comprise many different types of information about your project and the actual dataset. However, all forms of documentation include basic information which allows the data to be correctly interpreted and reused. Some academic disciplines may prefer one particular documentation format to another. You should find out which documentation formats are usually used within your academic discipline.
It is important that the documentation of your data is available together with your dataset. The documentation is crucial for being able to identify your dataset and for understanding which data you have collected or created, when and how you have done it, which variables and values are contained in the dataset as well as how one can access your data. Your documentation therefore helps to protect your copyright, and if you upload your data to a repository, the documentation will be used to track how your data are being reused.
You will find more information about project documentation and publishing data at ‘Research data management for students’
________________________________________________________
Reflection exercise on metadata
You can get a clearer idea of what metadata is by taking a closer look at data sources you know from your daily life.
Investigate and reflect on the following questions:
- What metadata exist for music tracks on Spotify, and what can they be used for?
- What metadata exist on Zalando.dk, and what can they be used for?
- What metadata exist for images on Flickr, and what can they be used for?
- What metadata exist for books in the University of Copenhagen Library catalogue, and what can they be used for?
- Find a dataset at opendata.dk Links to an external site.. What metadata exist for the dataset, and what can they be used for?
________________________________________________________
Reflection exercise on your own experience with data documentation
Reflect on the following:
- Think about one of your projects or assignments and the data you worked with. What forms of data documentation did you use, or could you have used? Why?
- Have you ever had trouble understanding or reusing other people’s data due to lack of documentation?
- Have you ever had trouble understanding your own data due to lack of documentation?
________________________________________________________
Further information
There are several resources where you can find out more about documenting data:
README
- The README template Download The README template will help you structure your description of your dataset.
- Guidelines for creating a README Links to an external site.- 4tu.federation. (n.d.).
- ICPSR’s Guide to social science data preparation and archiving
Links to an external site. (social science)
See the section ‘Best Practice in Creating Metadata’ on documentation.
Data dictionaries & codebooks
- Data Ab Initio’s Data dictionaries Links to an external site.
- McGill University Health Center’s Codebook cookbook: How to enter and document your data Links to an external site.
- ICPSR’s What is a codebook? Links to an external site.
Laboratory records
- Rice University’s Experimental Biosciences’ Guidelines for keeping a laboratory record Links to an external site.
- Colin Purrington’s Maintaining a laboratory notebook Links to an external site.
- University of Oregon’s Harms lab’s Data management in the lab Links to an external site.
Online courses (to enroll in the course, click here)
________________________________________________________