Metadata Standards
To produce consistent documentation, various research communities have established metadata standards. These specify in detail the information that should be included during documentation to describe a particular data object, dataset, or research project, according to the needs of a specific user or scientific community.
Most metadata standards will contain fields dedicated to the description of the data, the technical requirements needed to reuse it, information on licensing and intellectual property rights, and user access. Below is a summary of the most frequent types of metadata that can be used to describe a research project:
Study-level metadata | Goal | Example |
Descriptive metadata | Metadata required to find a digital object and assess usability | Author, title, abstract, date, location, time, data collection methods and tools |
Structural metadata | Relate the individual objects of a study | Links to related digital objects (e.g., the article linked to the research data) |
Technical metadata | Give information on the technical aspects of the dataset | Data format, hardware/software used, calibration, version, authentication, encryption, metadata standard |
Administrative metadata | Focus on user rights and management of digital objects | License, reasons for embargo, waivers, search logs, user tracking |
(Source: The ins and outs of metadata and data documentation)
Controlled Vocabularies
To optimize creation of metadata, one can use controlled vocabularies to populate the metadata elements. Controlled vocabularies are organized and standardized words and phrases that are used to provide a consistent way of describing data, cataloging, or indexing. They include subject headings, alphabetical lists, taxonomies, thesauri or ontologies. Using a controlled vocabulary will increase findability and shareability of your data with researchers in the same discipline. Controlled vocabularies used in various communities are:
- Arts & Humanities: Art & Architecture Thesaurus, Thesaurus of Musical Instruments
- Health Sciences & Medicine: International Classification of Disease (ICD), Medical Subject Headings (MeSH)
- Social Sciences: Ethnographic Thesaurus, Thesaurus for Economics
File naming
Deciding on naming conventions to be used for the files and folders that will contain your data allows you and anyone from your team to easily navigate the content, status, and version of the files in your database. An initial consideration when deciding on naming conventions is to make them both machine-readable as well as human-readable. For this the following tips will be helpful:
- Avoid spaces, punctuation, case sensitivity and characters such as ?\!@*%{[<>.
- Deliberate use of delimiters. Use a hyphen (-) to mean “different words that are part of the same chunk”, and underscore (_) to separate different chunks of metadata
- Choose keywords and file names that are sufficiently descriptive, e.g., analysis01_descriptive-statistics.R, analysis02_preregistered-analysis.R
- Use YYYY-MM-DD date format (ISO 8601 standard)
- To order files put date or number first, e.g., 2019-01-01_original-analysis.R, 2019-12-01_minor-changes-to-original.R, 01_original-analysis.R, 02_minor-changes-to-original.R
- Include the version of the file, e.g., methodology-section _v1
For added coherence and consistency, you can describe the naming conventions used in a separate README file. You can apply naming conventions to files and folders of your data even if you have already created most of them by either using a bulk-file renaming utility. This is a type of software that allows you to apply the same naming elements to multiple files until you reach a consistent naming structure for all your files and folders.
Versioning
Version control is the process of recording and managing different drafts and versions of a document or a dataset. It provides a track of the updates and revisions that led to the final version. Versioning is advised where more than one version of a document or a dataset exists (or it is likely to happen in the future). Depending on the data you are working with, it can be done by:
- Recording the date in the file naming, e.g, 2022-05-21_Health-test
- Adding sequential numbers at the end of the file name, e.g., _v1, _v2, _v3
- Creating a version control table that lists the number of changes, date, and their purpose
- Adding a “version control” tab in the spreadsheet with the version, date and changes columns
- Using version control software (e.g., Github, GitLab)
- Using tools that automatically keep versions of your work (e.g., Overleaf)
While working on a project, it is useful to decide how many versions of a file to keep, which version to keep, for how long and how to organize them. You could, for instance, identify milestone versions and decide to keep the major revision rather than minor revisions. It is helpful to stick to one naming convention, for example dates or version numbers. Agree on a single location for the storage of master versions.
README file
A readme file provides information about a project or a dataset. It helps to ensure that the data can be correctly interpreted by yourself (at a later date) or by others (when sharing or publishing data). Most of the times, a readme file must be submitted along with the dataset file(s). The main considerations are:
- Create one readme file for each dataset.
- Name the file README (not readme, read_me, ABOUT, etc.).
- Write it as a plain text file and save it as README.txt (or README.md when writing in Markdown).
Advice on Data Documentation pages is compiled based on the information provided by the RDNL, UK Data Service, CESSDA, the Finnish Social Science Data, Utrecht University, DCC and 4TU.ResearchData.
This page was last updated in January 2023. Did you find a broken link or (seemingly) incorrect information? Please send an email with the title 'Website content' to datasteward@eur.nl.