The CHeReL's probabilistic linkage procedures are designed to achieve a false positive rate around 5/1,000. This means that in a dataset of 100,000 persons (100,000 Project Person Numbers (PPNs)) it is expected that the records of around 500 PPNs will contain linkage errors. The CHeReL also aims to achieve a false negative rate around 5/1,000 although missing and incomplete identifiers can contribute to a higher degree of missed links. For each project the rate of linkage errors will be reported to the Chief Investigator. The estimated false positive rate for the current version of the Master Linkage Key is 5 per 1000 (0.5%).
Probabilistic record linkage software works by assigning a 'linkage weight' to pairs of records. For example records that match perfectly or nearly perfectly on first name, surname, date of birth and address have a high linkage weight, and records that match only on date of birth have a low linkage weight. If the linkage weight is high it is likely that the records truly match, and if the linkage weight is low it is likely that the records are not truly a match. This is shown in Figure 1.
Figure 1: Linkage weights in probabilistic record linkage
There are pairs of matched records where the linkage weights are neither high nor low, but somewhere in the middle. So how do we decide if they are true matches or not?
We could choose the middle linkage weight as a cut-off and arbitrarily say that all pairs of records with linkage weights above the cut-off are 'true' matches, and all pairs of records with linkage weights below the cut-off are 'false' matches. Unfortunately, this will result in some false matches with linkage weights above the upper cut-off being included with the true matches, and some true matches with linkage weights below the lower cut-off being lost.
At the CHeReL we choose to have two cut-offs:
The pairs of records with linkage weights between the upper and lower cut-offs are checked by hand. This is called clerical review (see 3.4 for details).
We aim to adjust the upper and lower cut-offs so that there are:
Where a linkage project involves records from the MLK, information is collected on whether false positive links relate to records already included in the Master Linkage Key or to the new records being linked to the MLK.
The record linkage software that is used by the CHeReL is ChoiceMaker. ChoiceMaker converts linkage weights to probabilities in the range of 0 to 1, with 0 representing a definite non-match and 1 representing a definite match.
The procedure for quality assurance in linkage projects is as follows;
We start each linkage by setting default cut-offs as follows:
Upper cut-off p= 0.75 Lower cut-off p= 0.25
The aim of adjusting the upper cut-off is to minimise the number of false positive matches that lie above the upper cut-off.
A random sample of 1,000 groups of matched records with probabilities that lie above the upper cut-off are reviewed by hand. If the false positive rate is above 5/1,000 the upper cut-off is raised to force these matches into the clerical review area. If there are no false positives, the upper cut-off is lowered to try to reduce the burden of clerical review. Once a new cut-off is selected, the linkage is run again and a new random sample of 1,000 groups of matched records that lie above the upper cut-off are reviewed by hand. The process is repeated until the false positive rate is below 5 per 1,000.
The aim of adjusting the lower cut-off is to minimise the number of true positive matches that lie below the lower cut-off, because these matches will be lost. We refer to true links that are lost as 'false negative' links.
We review groups of records with probabilities that are close to the lower cut-off. If there are no true matches, then we raise the lower cut-off to reduce the burden of clerical review. If there are true matches close to the lower cut-off we lower the cut-off to try and pick up any true matches that might be lying below the lower cut-off. A new lower cut-off is selected, the linkage is repeated and groups of records with probabilities that are close to the lower cut-off are reviewed again. The process is repeated until the false negative rate is below 5 per 1,000.
Groups of linked records with probabilities that lie between the upper and lower cut-offs are reviewed by the CHeReL Record Linkage Officers (RLOs). The RLO compares the records in each group across the full range of available information including first name, surname, date of birth, sex, and address, and decides which records in the group are matches and should stay together.
Once clerical review of uncertain matches is complete, a further review is carried out on a random sample of 5% of groups of records that have been reviewed by each RLO. This checking is carried out either by one of the database managers or an experienced RLO. If there are clerical review errors in more than 2.5% of the sample groups of records, all clerical review work of the RLO for the project is checked.
When a new batch of data is added to the Master Linkage Key, the CHeReL follows the same procedure that is used for record linkage projects. These procedures are designed to ensure that the addition of new records results in fewer than 5/1,000 false positives and fewer than 5/1,000 false negatives where full identifiers are available.
Once a year the CHeReL carries out a comprehensive quality assurance exercise on the Master Linkage Key, with the aim of detecting and correcting false positive and false negative links. The specific methods that are used vary from year to year. A report from the most recent year is available for download.