Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021 (ICPSR 39724)

Version Date: Mar 23, 2026 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Rebecca A. Hubbard, University of Pennsylvania

https://doi.org/10.3886/ICPSR39724.v1

Version V1

Slide tabs to view more

Researchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person's known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients.

Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It's even harder when data are missing due to a patient's health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes.

In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes.

To access the methods and software, please visit the bias_correction GitHub repository.

Hubbard, Rebecca A. Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021. Inter-university Consortium for Political and Social Research [distributor], 2026-03-23. https://doi.org/10.3886/ICPSR39724.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
Patient-Centered Outcomes Research Institute (PCORI) (ME-1511-32666)
Inter-university Consortium for Political and Social Research
Hide

2016 -- 2021
Hide

To develop and evaluate a latent class model for estimating phenotypes using EHR data

The research team developed a Bayesian latent class model for predicting a patient's phenotype. The model combined information on data availability and observed data values for each patient to estimate a latent, or unobserved, phenotype. The model assumed that the latent phenotype was correlated with model covariates, like biomarkers, clinical diagnosis codes, prescription medications, age, and gender.

The research team then simulated EHR data for 1,000 patients to resemble a sample of patients at high risk for type 2 diabetes. They introduced two patterns of missing data in the biomarkers: missing at random and missing not at random. The team evaluated the model using the simulated data. They compared the model's performance with existing phenotype estimation methods based on (1) biomarkers only, (2) clinical codes only, (3) biomarkers and clinical codes, (4) biomarkers with missing values replaced via MI, and (5) biomarkers and clinical codes with missing biomarker values replaced via MI. For each method, the team calculated sensitivity, specificity, and the proportion of patients misclassified relative to an actual type 2 diabetes diagnosis.

Clinicians, patients, and caregivers helped design the study.

Simulated data resembling a cohort of 1,000 patients at risk for type 2 diabetes

Hide

2026-03-23

Hide

Notes

  • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

  • ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.