GDPR: Use pseudonymisation and anonymisation to create GDPR compliance

By 
Henning Mortensen
December 4, 2019

Pseudonymisation and anonymisation are important techniques to improve the security of data subjects. Since the dawn of time, there has been a focus on anonymisation, whereas pseudonymisation is a newer concept. In this article, we will take a look at how pseudonymisation and anonymisation can play a role in creating GDPR compliance and protecting data subjects' rights.

What is anonymisation in the General Data Protection Regulation (GDPR)?

Anonymisation is referred to in preamble 26 of the General Data Protection Regulation as

‍"information that does not relate to an identified or identifiable natural person, or personal data that has been made anonymous in such a way that the data subject cannot – or can no longer be – identified".

That means that no one should be able to recognise the persons based on the information or by combining it with other information. It is a condition that the anonymisation is irreversible in the sense that there is no way for anyone to link the information to a physical person again. When data is anonymised, it is no longer personal data and thus falls completely outside the personal data regulation. Already from the data protection directive 95/46/EC, this has been the applicable practice.

What is pseudonymisation in the General Data Protection Regulation (GDPR)?

Pseudonymisation is defined in Article 4 of the General Data Protection Regulation as a

‍"processing of personal data in such a way that the personal data can no longer be attributed to a specific data subject without the use of supplementary information, provided that such supplementary information is stored separately and is subject to technical and organisational measures to ensure that the personal data is not attributed to an identified or identifiable physical person”.

Pseudonymisation is a newer concept that was not mentioned in the data protection directive, but which has found its way into the General Data Protection Regulation, where the concept is mentioned in 15 places – especially as a security measure or a design measure in the solutions that are used.

There is no doubt that anonymisation and pseudonymisation have enormous potential to increase the security of those registered – but this requires that it is done correctly. In relation to the General Data Protection Regulation, the central question is whether unauthorised persons can recall the anonymisation or pseudonymisation, and in that way re-identify the data subjects. If this happens, it is a breach of security within the meaning of the regulation, and the data controller can potentially be fined for not having implemented its measures in the right way.

Failed anonymisation, two famous cases

Case 1

In 2006, AOL published 20 million searches from a three-month period from 650,000 users with the aim of making the searches available to science.

AOL had previously anonymised data by removing IP addresses and replacing usernames with a unique code per User name. Based on the searches themselves, two journalists at The New York Times succeeded, relatively quickly, in identifying an elderly lady as one of the so-called anonymous users. In her searches, she had used personal names, geographical information, signaled that she had a dog and was interested in 60-year-old men. As the journalists write: "Her searches are a catalogue of intentions, curiosity, anxieties and quotidian questions".

Case 2

In 2006, Netflix published an anonymised data set with over 100 million non-public film ratings from 480,000 users, where the users' names were replaced by a number. At the same time, Netflix promised a prize of $1 million to whoever could help improve the Netflix movie recommendation algorithm based on this data. Two researchers analysed the data and paired it with a small sample of public data from the movie database, IMDb, where users also rated films and where the ratings were public. By correlating the two data sets, the researchers were able to identify 84% of Netflix users. At the same time, on the basis of the non-public ratings, they could assess whether the users had an interest in certain political, religious and sexually oriented films, and on that basis (with a certain probability) say something about the users' political observance, religious beliefs and sexual preferences.

Three risks of successful anonymisation

When you anonymise data, in order to assess whether the anonymisation can be attacked by others, you can try to uncover some of the threats that could be directed at the anonymised data set. Here, a distinction is made between three types of risk:

Singling out: Isolate some records in a dataset so that an individual is identified

Linkage/Linkability: Create a link between two records about a registered person

Derivation/Inference: With sufficient probability deduce the value of an attribute from the value of other attributes and on that basis identify the registered.

These three types of risks have been tested by the Article 29 group in relation to the usual anonymisation techniques.

Anonymisation techniques: Randomisation & generalisation

Generally, there are two ways to anonymise: randomisation and generalisation.

Randomisation means that you change the accuracy of the data so that it is no longer possible to create a connection between the data and the person. For this, there are different ways to randomise:

Randomisation

Noise addition:

Here you add noise to the observation. If the noise is random, you can distort the individual data, but preserve the average of the observations. If you for example have a data set with ten height measurements and add noise of the form +/- 10 cm to each of the measurements, the average is the same, but what was previously highest is no longer necessarily highest, so the mutual ranking has therefore changed.

Permutation:

Here, the observations in a data set are swapped, so that some data is associated with a different individual than originally. The advantage of this is that the values themselves are not changed.

Differential privacy:

Here, noise is also added, but only after an analyst has presented the question he would like an answer to. The noise is added in such a way that the response is representative, but without the analyst knowing if the data he is accessing is actually correct. The point is that the analyst's result must be the same regardless of whether a particular person is in the database or not.

Generalisation: Aggregation and K-anonymity

Generalisation means that you change the relative order of magnitude of the values associated with the registered. This lowers the level of detail. Instead of assigning an age of 47 to the registered person, you can say that the registered person is in the age group of 40-50 years. In this way, there are typically several registered persons who are associated with the same age, and it becomes less likely that the individual registered person can be designated. You can generalise on for example geography, age, salary, time, weight, height or doses.

Aggregation and k-anonymity:

Here you continue to generalise in classes until it is no longer possible to identify one registered in a group of k individuals. No outsider must be able to derive other attributes, not even if they have knowledge about a particular data subject being part of a data set and knowing an attribute. The information about anyone registered in the data set must not be distinguishable from the remaining k-1 persons in the data set.

An example:

Below, an outsider with knowledge that a registered person is in a data set and knowledge that the age is 20 years old, can establish the registered person's diagnosis with certainty.

Another example:

Below we have 2-anonymity for the age, gender and city attributes. Any combination of these attributes can be found in at least two rows in the dataset. However, we can also state that a certain 19-year-old man we know has one of three diagnoses.

L-diversity and T-closeness:

L-diversity extends k-anonymity in the sense that in each class there must be at least L different values. Even with L-diversity, one can identify an attribute of the register with a high probability based on a probability consideration, if there is an uneven distribution of the registered.

This is sought to be eliminated with T-closeness, where it is a requirement that each of the class' L values must follow the same distribution as the initial distribution of each attribute. The advantage of L-diversity and T-closeness is that an attacker cannot be completely sure that a registered has a particular attribute.

Pseudonymisation techniques

Pseudonymisation consists in replacing something immediately identifying, such as for instance a Social Security number with a different numerical value in a data set. The relationship between this numerical value and the Social Security number is then stored in a second data set separate from the first data set. Pseudonymised information is still personal data because someone can recreate the context – namely, those in possession of the other data set. Security is enhanced because an attacker cannot necessarily immediately establish a connection between the registered person and data in the first data set.

Pseudonymisation techniques include, among others:

Encryption with secret key: identifying data (for instance Social Security number) is encrypted with the secret key, where the person in possession of the key can recreate the context.

Hash functions: identifying data (for instance Social Security number) in a database is hashed, but if an attacker hashes all Social Security numbers and compares these with the hash values in the database, the identified data (for instance Social Security number) by comparing the hash values.

Various other crypto techniques.

Conclusion

In practice, it is difficult to anonymise personal data completely and many data controllers have had to learn throughout history that their attempts have failed. However, pseudonymisation and anonymisation have great potential, because the data subject's risks during processing are considerably reduced. In practice, a great deal of case processing can be done on pseudonymous data, so that only a very limited group of actors can actually find out who they are processing data about. Similarly, one could imagine that the key to re-establishing the connection between identity and pseudonym was left to the data subject, thus giving the data subject maximum control over his personal data.

Discover how Wired Relations can ensure overview, structure and control in data protection.
Try it for free.