Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021 (ICPSR 39724)
Version Date: Mar 23, 2026 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Rebecca A. Hubbard, University of Pennsylvania
https://doi.org/10.3886/ICPSR39724.v1
Version V1
Summary View help for Summary
Researchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person's known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients.
Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It's even harder when data are missing due to a patient's health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes.
In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes.
To access the methods and software, please visit the bias_correction GitHub repository.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Distributor(s) View help for Distributor(s)
Study Purpose View help for Study Purpose
To develop and evaluate a latent class model for estimating phenotypes using EHR data
Study Design View help for Study Design
The research team developed a Bayesian latent class model for predicting a patient's phenotype. The model combined information on data availability and observed data values for each patient to estimate a latent, or unobserved, phenotype. The model assumed that the latent phenotype was correlated with model covariates, like biomarkers, clinical diagnosis codes, prescription medications, age, and gender.
The research team then simulated EHR data for 1,000 patients to resemble a sample of patients at high risk for type 2 diabetes. They introduced two patterns of missing data in the biomarkers: missing at random and missing not at random. The team evaluated the model using the simulated data. They compared the model's performance with existing phenotype estimation methods based on (1) biomarkers only, (2) clinical codes only, (3) biomarkers and clinical codes, (4) biomarkers with missing values replaced via MI, and (5) biomarkers and clinical codes with missing biomarker values replaced via MI. For each method, the team calculated sensitivity, specificity, and the proportion of patients misclassified relative to an actual type 2 diabetes diagnosis.
Clinicians, patients, and caregivers helped design the study.
Data Source View help for Data Source
Simulated data resembling a cohort of 1,000 patients at risk for type 2 diabetes
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.

This study is maintained and distributed by the Patient-Centered Outcomes Data Repository (PCODR). PCODR is the official data repository of the Patient-Centered Outcomes Research Initiative (PCORI).