2. Research Data – types, format, and sizes
Learning objectives
When you have completed this lesson, you will be able to:
- Identify different types of data
- Understand how data type, file format, and file size can influence how you manage your data
- Explain which data types need additional protection and why this is important.
____________________________________________________________
Data types
As mentioned in the previous lesson, it is good practice to develop a plan for data management at project start, as this will help you identify possible challenges in advance. A good starting point to determine the data management approach for your project, is to identify and describe the research data you will work with. To describe the data in your project, you may want to group your data into categories. There are various ways in which you can do this. Below are three examples:
1. Quantitative vs. qualitative data
You could group your data into two overarching categories:
- Quantitative data are data that can be counted or compared on a numerical scale, such as measurements made by laboratory equipment, counts of the daily number of visitors to an exhibition, and survey data on income and spending.
- Qualitative data are non-numerical and describe qualities or characteristics. Examples are interview transcripts, responses to open-ended questionnaires, photographs or audio files.
2. Categorisation by collection method
You can also group your data based on how they are obtained.
- Observational data are data collected through the observation of an activity, such as sensor readings or observations of animal or human behaviour.
- Experimental data are data collected under controlled conditions, often by manipulating a variable in a study and measuring the outcome. Examples are plant growth data under various light treatments or a parameter in a search query.
- Interview data are data collected by asking questions to (groups of) individuals to gather quantitative or qualitative information, such as when studying cultural identity or when measuring user satisfaction with a particular service.
- Simulation data are data generated by computer models to simulate real conditions, such as economical or meteorological models.
- Derived data are data created by combining and processing existing data, such as through text mining of literature or data mining of datasets.
3. Categorisation by level of security required (data classification)
It is very important that you are able to identify whether there are ethical or legal reasons for you to be extra diligent in protecting your data against unauthorized access or loss of data. This may be required to protect the privacy of individuals, to protect animals, ecosystems, vulnerable populations, cultural heritage etc., or because there are contractual agreements preventing data disclosure. You can therefore group your data by the level of data protection they require. Here, we roughly divide data into two categories:
- Data where ethical or legal reasons for data protection exist:
Personal data are data that can directly or indirectly identify a person.
Confidential data are data other than personal data that should only be accessible to a limited number of people, and the accidental or deliberate exposure of the which can have considerate consequences. Examples are company data, data that have commercial potential, classified government data, and sensitive biological data. - Data where no ethical or legal reasons for data protection exist:
You may work in a project where there are no ethical or legal reasons for data access protection and the data can be disclosed without negative consequences. Some examples are publicly available data sets, data presented in scientific publications, anonymized personal information, non-sensitive economic data and archaeological data from studies that do not involve sensitive cultural or heritage information.
By identifying the level of security required, you are halfway into conducting a so-called “data classification”. We will revisit and expand on this concept in lesson 6 ‘data storage and security’.
____________________________________________________________
Why is it important to assess data type?
Why is it important to thoroughly consider and describe the type of data that you will work with? It is important, because the type of data you will work with will influence how you will manage your data. Here are some examples to demonstrate this:
Example 1. Collection and documentation practice may differ between data types
If you work with quantitative data, you will use different methods to document your data than if you work with qualitative data. For example, you may record your data in Excel spreadsheets, rather than in text documents.
Example 2. Requirements for the use of data may differ between data types
If you decide to reuse already existing data sets (‘derived data’) rather than producing your own data, in your project, you will have to investigate whether the data provider, such as a company or registry, poses rules for what you can and cannot do with the data.
Example 3. The measures needed to securely store data may differ between data types
If you work with personal data, you must store the data in a way that prevents unauthorized access. This requires additional security features, such as data encryption and storage on a specific drive for sensitive data. If you work with anonymized data, you may not need to encrypt your data, and your data can be stored on a normal university drive.
In other words, the way data are managed may differ from student to student and from project to project, depending on the type of data in the project. This means that your data management plan (DMP) may look completely different in comparison to the DMP of another student or project. This also means that you should make a new plan every time you start a new project.
____________________________________________________________
Other data descriptors: file format and data size
In addition to the data type, the format of your digital files as well as the size or volume of your data and physical material, can also influence how you will work with your data.
File format
When you plan your project, you should consider which file format to store your data in. The file format might be determined by the standards in your research field, as well as the software you use to manage your data. In case you use proprietary software (software owned by a company or individual) for example, you need to bear in mind that you and those who need access to the data (e.g. other project members) might not be able to access the data using a different software. This will be problematic if your project members, for example, don’t have the software, you lose access to it or the software is no longer available. Therefore, where possible it is best opt for open file formats, such as the .csv format for tabular data and the .txt format for text files.
Data size
Data size can be expressed in numerous ways, such as the number of books or articles used in your text corpus, the number of samples used in your biological study, or the MB, GB or TB of digital data files you will work with. Estimating how much data you will produce will help you determine whether you have sufficient disk space to record your audio and video files, and to process and store your digital data, or whether you need to find another storage solution. Likewise, if you work with physical material, you will have to determine how much freezer or cabinet space you need.
____________________________________________________________
Data types in practice
Now let’s look at some examples of the different types of data at the UCPH. You will meet seven students from different faculties. They will tell you about their projects and the data they work with.
Master’s thesis about youth quotas in parliaments and political parties
Malthe Rugberg Andersen, Faculty of Social Sciences
Data types: quantitative data, derived data, non-confidential data, surveys, publicly available data
If you experience access denied, reload the page or try another browser
For English subtitles, please look for the CC icon in the lower right corner of the video and press English.
Master’s thesis about reducing risk factors in obese people through weight loss and/or medicine
Rasmus Michael Sandsdal, Faculty of Health and Medical Sciences
Data types: quantitative data, derived data, observational data, personal data
PhD project about grammatical anomalies in Danish as a foreign language
Katrine Falcon Søby, Faculty of Humanities
Data types: qualitative data, quantitative data, personal data
Master’s thesis: Ensemble Modelling in Spectroscopy – Improving performance and estimating uncertainty
Jakob Riber Rasmussen, Faculty of Science
Data types: quantitative data, qualitative data, derived data, confidential data
Master’s thesis examining similarities between the Nordic sagas and Homer’s Odyssey and Iliad
Martin Herskind, Faculty of Humanities
Data types: derived data, qualitative data, quantitative data, publicly available data
Bachelor’s thesis examining the molecular mechanisms related to atrial fibrillation
Frida Birkedal Christiansen, Faculty of Health and Medical Sciences
Data types: quantitative data, experimental data, non-confidential data
Bachelor’s thesis examining the pronunciation of French language in Belgium
Maya Amalie Haven Træsborg, Faculty of Humanities
Data types: interview data, qualitative data, personal data
____________________________________________________________
Test yourself
Check whether you captured the main points in this lesson:
Quiz: Research data - types, formats and sizes
____________________________________________________________
Continue with your DMP
Please continue working on your project’s data management plan (DMP) that you started in lesson 1.
Describe the research data in your project by answering the questions in section 2 of your DMP:
Describe what data/material you will collect, observe, generate, create or reused in the project.
2.a Indicate the type(s) of data for each data set in your project.
2.b Are there any personal data or confidential data in your project?
2.c Origin/source: where do the data/material come from?
2.d Estimated size or volume: how much data/material will you work with?
2.e Expected file format(s): in what format will your digital data be saved?
If you haven't begun filling out your DMP yet, you can find the DMP template here: UCPH DMP Template for Students
Please remember to discuss the data management plan with your supervisor at the start of your project. Keep the DMP stored along with your data.
____________________________________________________________
Practical tips and resources for assessing data types
- When embarking on a Bachelor or Master project, start by considering which data type you will work with. You may also want to read the section on data classification in lesson 6 “data storage and security”.
- If you plan to work with personal data, it is strongly recommended that you complete the online GDPR course for students before starting your project. The course takes about an hour to complete and it can be found here.
- If you plan to work with personal data, please view the guidelines and templates on the pages ‘How to collect and process personal data’ on your study program’s information pages on KUnet. You will find these pages under the section ‘Planning your studies’.
- Check this course’s RDM Glossary for definitions of terms used in this lesson and the other lessons.
____________________________________________________________
Learn more
Below, you will find some external resources for some optional further reading on some of the topics mentioned in this lesson.
Tips and tricks about file formats
Recommended file formats
UK Data Service: Recommended formats
File format database
Searchable database of different file formats and detailed information about them: FileInfo: The File Format Database
____________________________________________________________
Published in 2024