Data Linkage Safeguards


There are a number of data linkage safeguards that can be used during a project to ensure that privacy is considered at every step of the process. For example:

Statistical disclosure control

This forms part of the data linkage quality assurance process and ensures that results of the research (often referred to as outputs) do not include information that could be used to identify an individual. Before the researcher can remove their results from the secure environment, it is reviewed by a ‘Research Coordinator’ to ensure that there is no identifiable information within it.

For example, a data linkage project is established to identify if people living in certain areas are more likely to contract a specific disease. This will allow for the targeted roll-out of a disease prevention programme across Scotland. The results of the data linkage project highlights that only one person in a remote rural location suffers from the disease. As this information could be used to identify the individual, it is removed from the results ahead of release from the safe haven and subsequent publication.
There are a number of methods for ensuring ‘statistical disclosure’, details of which can be found here: http://www.scotland.gov.uk/Topics/Statistics/About/Methodology/DiscCont

Proportionate risk management

Data users and data controllers must assess the level of risk associated with data linkage projects and manage these responsibly. This ensures efficient working, reducing the amount of time and money spent on the project while maintaining a responsible approach to data linkage by ensuring consideration of privacy at an individual level. This is termed ‘proportionate risk management’.

It is not in the public’s interest to undertake unnecessary anonymisation work if the risk is low, as it increases public spend with very little or no benefit. Additional, unnecessary work is also time consuming and can substantially delay the production of results which are in the public interest and should therefore be produced and reported in a timely manner to support, for example, better service provision across Scotland. Where a risk is identified, it must be addressed adequately.

Separation of functions

The best way to explain ‘separation of functions’ is to work through an example. The following illustrative example demonstrates how and why having data controllers, indexers and linkers all separated helps protect privacy while achieving data linkage.

A researcher wants to study the relationship between health and employment. Their specific research question requires them to analyse four variables: occupation, income, blood count and medication. In order to do the research it is necessary to link data from two different organisations.

In the first organisation, the ‘Health Data Controller’, controls a dataset which contains information about blood count and medication. In the second organisation, the ‘Employment Data Controller’ controls a dataset which contains information about employment and income.

The ‘Health Data Controller’ takes a copy of their dataset attaches a number made for this purpose, called an Indexing Number, to each record and sends this along with the name, address and date of birth of the people in his dataset to an ‘Indexer’. The ‘Health Data Controller’ does not send any information about blood count or medication to the Indexer.

The ‘Employment Data Controller’ takes a copy of their dataset and also creates and attaches an Indexing Number to each record and sends that along with the name, address and date of birth of the people in her dataset to the same ‘Indexer’. The ‘Employment Data Controller’ does not send any information about employment or income to the Indexer.

Note that the ‘Employment Data Controller’ makes up the Indexing Number for their data completely independently of the Health Data Controller – they are different indexing numbers.

The ‘Indexer’ links the two datasets that have been received together based on the names, addresses and dates of birth in the datasets. The ‘Indexer’ creates two ‘Linking Numbers’, one for the health data and one for the employment data, and keeps a look-up table of those linking numbers.

The ‘Indexer’ sends the health Indexing Number back to the ‘Health Data Controller’ with the Health Linking Number attached.

The ‘Indexer’ also sends the employment Indexing Number back to the ‘Employment Data Controller’ with the employment linking number attached.

The ‘Indexer’ then sends the linking number look-up table to the linker, and safely destroys the two copies of the names, addresses and date of births they originally received – the ‘Indexer’ is left with nothing.

The ‘Health Data Controller’ now attaches the health linking number they has been sent onto a copy of their dataset using the indexing number they had made up so they know which linking number matches to which record. The ‘Health Data Controller’ then sends blood count, medication and linking number to the ‘Linker’. The ‘Health Data Controller’ does not provide the ‘Linker’ with the names, addresses and date of births of the people in the dataset, or the indexing number. The ‘Health Data Controller’ then destroys their copy of the linking number.

And the ‘Employment Data Controller’ does the same thing: attaches the employment linking number they have been sent onto a copy of their dataset using the indexing number they made up so they know which linking number matches to which record. The ‘Employment Data Controller’ then sends employment, income and linking number to the ‘Linker’. They do not provide the ‘Linker’ with the names, addresses and date of births of the people in the dataset, or the indexing number. The ‘Employment Data Controller’ destroys their copy of the linking number.

The ‘Linker’ can now join together blood count, medication, employment and income together using the linking number look-up table. The ‘Linker’ then deletes the linking numbers and the lookup table and adds a new unique and meaningless ‘Project Number’ to each record. They deposit this in a safe haven for the researcher to access.


At the end of this process:

The ‘Health Data Controller’ has learned nothing new about the people in their dataset. They haven’t seen any information about their employment or income.

The ‘Employment Data Controller’ has learned nothing new about the people in their dataset. They haven’t seen any information about their employment or income.

The ‘Indexer’ saw a lot of names, addresses and dates of birth, but did not see any information about those people’s income, employment, blood count or medication.

The ‘Linker’ and the researcher see data about income, employment, blood count and medication but no names, addresses or dates of birth.