Utilising Identifier Error Variation in Linkage of Large Administrative Data Sources

Overview

Journal BMC Med Res Methodol

Publisher Biomed Central

Specialties General Medicine
Health Services

Date 2017 Feb 9

PMID 28173759

Citations 12

Authors

Katie Harron

Gareth Hagger-Johnson

Ruth Gilbert

Harvey Goldstein

Affiliations

Soon will be listed here.

Abstract

Background: Linkage of administrative data sources often relies on probabilistic methods using a set of common identifiers (e.g. sex, date of birth, postcode). Variation in data quality on an individual or organisational level (e.g. by hospital) can result in clustering of identifier errors, violating the assumption of independence between identifiers required for traditional probabilistic match weight estimation. This potentially introduces selection bias to the resulting linked dataset. We aimed to measure variation in identifier error rates in a large English administrative data source (Hospital Episode Statistics; HES) and to incorporate this information into match weight calculation.

Methods: We used 30,000 randomly selected HES hospital admissions records of patients aged 0-1, 5-6 and 18-19 years, for 2011/2012, linked via NHS number with data from the Personal Demographic Service (PDS; our gold-standard). We calculated identifier error rates for sex, date of birth and postcode and used multi-level logistic regression to investigate associations with individual-level attributes (age, ethnicity, and gender) and organisational variation. We then derived: i) weights incorporating dependence between identifiers; ii) attribute-specific weights (varying by age, ethnicity and gender); and iii) organisation-specific weights (by hospital). Results were compared with traditional match weights using a simulation study.

Results: Identifier errors (where values disagreed in linked HES-PDS records) or missing values were found in 0.11% of records for sex and date of birth and in 53% of records for postcode. Identifier error rates differed significantly by age, ethnicity and sex (p < 0.0005). Errors were less frequent in males, in 5-6 year olds and 18-19 year olds compared with infants, and were lowest for the Asian ethic group. A simulation study demonstrated that substantial bias was introduced into estimated readmission rates in the presence of identifier errors. Attribute- and organisational-specific weights reduced this bias compared with weights estimated using traditional probabilistic matching algorithms.

Conclusions: We provide empirical evidence on variation in rates of identifier error in a widely-used administrative data source and propose a new method for deriving match weights that incorporates additional data attributes. Our results demonstrate that incorporating information on variation by individual-level characteristics can help to reduce bias due to linkage error.

Citing Articles

Generating synthetic identifiers to support development and evaluation of data linkage methods.

Lam J, Boyd A, Linacre R, Blackburn R, Harron K Int J Popul Data Sci. 2024; 9(1):2389.

PMID: 39620124 PMC: 11606631. DOI: 10.23889/ijpds.v9i1.2389.

Microsimulation of an educational attainment register to predict future record linkage quality.

Schnell R, Weiand S Int J Popul Data Sci. 2023; 8(1):2122.

PMID: 37649490 PMC: 10463005. DOI: 10.23889/ijpds.v8i1.2122.

Virtual patient identifier (vPID): Improving patient traceability using anonymized identifiers in Japanese healthcare insurance claims database.

Sato J, Mitsutake N, Yamada H, Kitsuregawa M, Goda K Heliyon. 2023; 9(5):e16209.

PMID: 37234615 PMC: 10205637. DOI: 10.1016/j.heliyon.2023.e16209.

A framework for a consistent and reproducible evaluation of manual review for patient matching algorithms.

Gupta A, Kasthurirathne S, Xu H, Li X, Ruppert M, Harle C J Am Med Inform Assoc. 2022; 29(12):2105-2109.

PMID: 36305781 PMC: 9667171. DOI: 10.1093/jamia/ocac175.

Tucuxi-BLAST: Enabling fast and accurate record linkage of large-scale health-related administrative databases through a DNA-encoded approach.

Araujo J, Santos-E-Silva J, Costa-Martins A, Sampaio V, de Castro D, de Souza R PeerJ. 2022; 10:e13507.

PMID: 35846888 PMC: 9281601. DOI: 10.7717/peerj.13507.

References

Aldridge R, Shaji K, Hayward A, Abubakar I . Accuracy of Probabilistic Linkage Using the Enhanced Matching System for Public Health and Epidemiological Studies. PLoS One. 2015; 10(8):e0136179. PMC: 4547731. DOI: 10.1371/journal.pone.0136179. View

Tromp M, Meray N, Ravelli A, Reitsma J, Bonsel G . Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies. J Am Med Inform Assoc. 2008; 15(5):654-60. PMC: 2528043. DOI: 10.1197/jamia.M2265. View

Brenner H, Schmidtmann I . Effects of record linkage errors on disease registration. Methods Inf Med. 1998; 37(1):69-74. View

Clark D . Practical introduction to record linkage for injury research. Inj Prev. 2004; 10(3):186-91. PMC: 1730090. DOI: 10.1136/ip.2003.004580. View

Jasilionis D, Stankuniene V, Ambrozaitiene D, Jdanov D, Shkolnikov V . Ethnic mortality differentials in Lithuania: contradictory evidence from census-linked and unlinked mortality estimates. J Epidemiol Community Health. 2011; 66(6):e7. DOI: 10.1136/jech.2011.133967. View

Hagger-Johnson G, Harron K, Gonzalez-Izquierdo A, Cortina-Borja M, Dattani N, Muller-Pebody B . Identifying Possible False Matches in Anonymized Hospital Administrative Data without Patient Identifiers. Health Serv Res. 2014; 50(4):1162-78. PMC: 4545352. DOI: 10.1111/1475-6773.12272. View

Wijlaars L, Hardelid P, Woodman J, Allister J, Cheung R, Gilbert R . Who comes back with what: a retrospective database study on reasons for emergency readmission to hospital in children and young people in England. Arch Dis Child. 2016; 101(8):714-8. DOI: 10.1136/archdischild-2015-309290. View

Bohensky M, Jolley D, Sundararajan V, Evans S, Pilcher D, Scott I . Data linkage: a powerful research tool with potential problems. BMC Health Serv Res. 2010; 10:346. PMC: 3271236. DOI: 10.1186/1472-6963-10-346. View

Ford J, Roberts C, Taylor L . Characteristics of unmatched maternal and baby records in linked birth records and hospital discharge data. Paediatr Perinat Epidemiol. 2006; 20(4):329-37. DOI: 10.1111/j.1365-3016.2006.00715.x. View

10.

Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H . Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol. 2014; 14:36. PMC: 4015706. DOI: 10.1186/1471-2288-14-36. View

11.

Lariscy J . Differential record linkage by Hispanic ethnicity and age in linked mortality studies: implications for the epidemiologic paradox. J Aging Health. 2011; 23(8):1263-84. PMC: 4598042. DOI: 10.1177/0898264311421369. View

12.

Daggy J, Xu H, Hui S, Gamache R, Grannis S . A practical approach for incorporating dependence among fields in probabilistic record linkage. BMC Med Inform Decis Mak. 2013; 13:97. PMC: 3766252. DOI: 10.1186/1472-6947-13-97. View

13.

Sayers A, Ben-Shlomo Y, Blom A, Steele F . Probabilistic record linkage. Int J Epidemiol. 2015; 45(3):954-64. PMC: 5005943. DOI: 10.1093/ije/dyv322. View

14.

Benchimol E, Smeeth L, Guttmann A, Harron K, Moher D, Petersen I . The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement. PLoS Med. 2015; 12(10):e1001885. PMC: 4595218. DOI: 10.1371/journal.pmed.1001885. View

15.

Goldstein H, Harron K, Wade A . The analysis of record-linked data using multiple imputation with data value priors. Stat Med. 2012; 31(28):3481-93. DOI: 10.1002/sim.5508. View

16.

Moore C, Amin J, Gidding H, Law M . A new method for assessing how sensitivity and specificity of linkage studies affects estimation. PLoS One. 2014; 9(7):e103690. PMC: 4113448. DOI: 10.1371/journal.pone.0103690. View

17.

Schmidlin K, Clough-Gorr K, Spoerri A, Egger M, Zwahlen M . Impact of unlinked deaths and coding changes on mortality trends in the Swiss National Cohort. BMC Med Inform Decis Mak. 2013; 13:1. PMC: 3547805. DOI: 10.1186/1472-6947-13-1. View

18.

Hagger-Johnson G, Harron K, Fleming T, Gilbert R, Goldstein H, Landy R . Data linkage errors in hospital administrative data when applying a pseudonymisation algorithm to paediatric intensive care records. BMJ Open. 2015; 5(8):e008118. PMC: 4550723. DOI: 10.1136/bmjopen-2015-008118. View

19.

Harron K, Wade A, Muller-Pebody B, Goldstein H, Gilbert R . Opening the black box of record linkage. J Epidemiol Community Health. 2012; 66(12):1198. DOI: 10.1136/jech-2012-201376. View

20.

Jaro M . Probabilistic linkage of large public health data files. Stat Med. 1995; 14(5-7):491-8. DOI: 10.1002/sim.4780140510. View