Statistical Methods and Designs for Addressing Correlated Errors in Outcomes and Covariates in Studies Using Electronic Health Records Data [Methods Study], Tennessee, 2016-2021 (ICPSR 39726)
Version Date: Mar 12, 2026 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Bryan E. Shepherd, Vanderbilt University Medical Center
https://doi.org/10.3886/ICPSR39726.v1
Version V1
Summary View help for Summary
Electronic health records, or EHRs, have data on patient traits, health problems, and treatments. Researchers can use EHR data to study how treatments work or which patient traits affect health outcomes. But EHR data can have errors.
The best way to get accurate EHR data is to closely review patients' original records. But reviewing all patient records isn't possible when many patients are in a study. In such cases, researchers can review and correct records for a few patients and use the revised records to adjust data for all patients. But existing methods for using revised records don't address some kinds of errors, such as errors that are related. For example, errors in a treatment starting date can lead to mistakes in the data on length of treatment.
In this project, the research team created and tested new methods to improve the accuracy of EHR data. The new methods corrected records from some patients. Then the team used the corrections to address related errors for all patients.
To access the methods and software, please visit the MeasurementErrorMethods GitHub repository.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Distributor(s) View help for Distributor(s)
Study Purpose View help for Study Purpose
- Develop novel statistical methods that reduce or eliminate bias caused by correlated errors in time-to-event outcomes and covariates, thereby addressing an important setting for which there is a lack of available methods
- Design optimal multiwave validation strategies, where one divides the validation sample into multiple waves and decides which records to validate in later sampling waves based on results learned from earlier sampling waves
- Apply the methods and designs to a study investigating the association between maternal weight gain during pregnancy and childhood health outcomes using EHR data
Study Design View help for Study Design
The research team developed methods to address correlated errors in statistical analyses of EHR data. To do this, the team first manually validated data for a subsample of patients. The team then developed four new methods using the validated data to reduce bias caused by correlated errors:
- Multiple imputation (MI)
- Regression calibration (RC)
- Generalized raking (Raking)
- Sieve maximum likelihood estimation (SMLE)
Then the team conducted simulations to compare the robustness and efficiency of the four methods and created open-source software for all four methods.
Based on the simulation results, the research team further tested the Raking method using real patient data from 10,335 mother-child pairs, of which 996 pairs were validated by chart review, looking at whether weight gain during pregnancy predicted risk for childhood obesity.
Patients, caregivers, and doctors gave input throughout the study.
Data Source View help for Data Source
Simulated data; EHR data from Vanderbilt University Medical Center on 10,335 mother-child pairs
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.
