Methods for anonymising qualitative data

For questions, contact your privacy officer (PO). On the My EUR page of the Privacy Office (PO) you can find contact details for your faculty’s PO.

Things to think about before data collection

One of the best ways to protect the privacy of research participants is not to collect certain identifiable information at all. While planning your research, please, consider data minimisation. Limit the collection of personal information to data directly relevant and necessary to the purposes of your study. For example, if possible in your study, before data collection, you can ask participants to anonymise their experiences by avoiding mention of full personal names, exact dates, employment locations, or detailed information related to third persons.

Planning anonymisation at an early stage of the research (for instance, in the data management plan) will help you to identify the resources needed in the different stages of the research life cycle.

In the absence of consent, the data you disclose must be anonymous. Anonymisation is best planned early in the research process, to help reduce anonymisation costs.It should be noted that anonymization in qualitative data deals with ‘balancing’ two different priorities: protecting the identities of participants and maintaining the value and integrity of the data. Excessive removal of information in qualitative data such as text or audio/video recordings can lead to distortion of data, making them unusable, unreliable or misleading. To balance privacy protection and keeping data useful, anonymisation should be considered alongside informed consent and access controls.

Pre-planning and agreeing with participants during the consent process, on what may and may not be recorded or transcribed, can be a much more effective way of creating data that accurately represents the research process and the contribution of participants. For example, if an employer’s name cannot be disclosed, it should be agreed in advance that it will not be mentioned during an interview. This is easier than spending time later removing it from a recording or transcript.

Personal data contains information that directly or indirectly identifies a natural person (for definitions and examples see this link). Generally speaking, direct identifiers and strong indirect identifiers need to be removed or replaced with pseudonyms. Indirect identifiers can either be removed or categorized. In the case of qualitative data, categorising means coarsening identifying information, which is a better choice when the indirect identifier is essential for comprehending the data. For example, instead of mentioning the age of a participant, use categories such as [20-25 years old]. This concerns such indirect identifiers as: Postal code, District/Part of town, Municipality of residence, Region, Municipality type, Year of birth, Age, Household composition, Occupation, Education, Mother tongue, Nationality, Workplace/Employer, Crime or punishment, Position of trust or membership + all special categories information.

Best practices for pseudonymisation/anonymisation of qualitative data

Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixelating sections of a video image significantly, reduces the usefulness of data. These processes are also highly labour intensive and expensive.

If confidentiality of audio-visual data is an issue, it is better to obtain the participant’s consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as a better strategy.

  • Plan anonymisation and experiment with a couple of files at the time of transcription or initial write-up. Longitudinal studies may be an exception if relationships between waves of interviews need special attention for harmonised editing. 
  • Use pseudonyms or generic descriptors to edit identifying information, rather than blanking-out that information. 
  • Use pseudonyms or replacements that are consistent throughout the research team and the project. For example, using the same pseudonyms in publications and follow-up research.
  • Identify replacements in text clearly, for example with [brackets] or using XML tags such as <seg>word to be anonymised</seg>.
  • Use 'search and replace' techniques carefully so that unintended changes are not made, and misspelled words are not missed.
  • Create a copy of the files to be anonymised and anonymise the copied files. This way, possible errors in anonymisation can still be fixed.
  • Back up the original unedited version of the files (but store them separately) for use within the research team and for preservation. For persons who have both the unedited version and the anonymised version, the data is pseudonymised.
  • Create a pseudonymisation key (also known as an anonymisation log) of all replacements, aggregations or removals made and store such a log securely and separately from the anonymised data files.

  1. Find and highlight direct identifiers by reading the transcript. 
  2. Assess indirect identifiers: 
    • Can the identity of a participant be known from information in the data file? 
    • Can a third party be disclosed or harmed from information in the data file? 
  3. Assess the wider picture:
    • Which identifying information about an individual participant can be noted from all the data and documentation available to a user? Remove (or pseudonymise) direct identifiers.
    • Which indirect identifiers are essential for understanding the data? Redact or categorize the indirect identifiers.
  4. Re-assess any remaining disclosure risk.

Further reading

The UK Data Service has developed a Text anonymisation helper tool with how to install instructions. It is an add-on MS Word macros for aiding anonymisation of qualitative data. The tool does not anonymise or make changes to data but finds and highlights numbers and words starting with capital letters in text. Numbers and capitalised words are often disclosive, it can be names, companies, birth dates, addresses, educational institutions and countries.

CESSDA has a detailed example/exercise of anonymising a transcript at the bottom of this page.

On the page of the Finnish Social Science Data Archive you can find practical tips and a detailed guide of techniques for anonymisation of qualitative data (which can also be used in case anonymisation can only be done to a degree).

UK Data Service has a whole page on the best practices of transcribing audio-visual data. In case you decide (or are considering) to use external transcribers or automatic speech recognition (ASR) software to do an initial transcription, do contact your privacy officer to discuss if and which agreements need to be signed (before the use of the software).

The open-source text anonymisation software Textwash allows researchers who know Python basics to automatically detect and replace potential identifiers in English-language text. More information can be found in this paper by Kleinberg and colleagues (2022) and on the project’s GitHub page. Building on Textwash, the tool FAMTAFOS will feature an easy-to-use desktop app that allows users to anonymise English and Dutch texts at scale.

Advice on this page is compiled based on the information provided by the UK Data Service, CESSDA, the Finnish Social Science Data Archive and FORS.

This page was last updated in June 2024. Did you find a broken link or (seemingly) incorrect information? Please send an email with the title 'Website content' to datasteward@eur.nl.

Compare @count study programme

  • @title

    • Duration: @duration
Compare study programmes