Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021 (ICPSR 39724)

Name: Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021
Published: 2026-03-23
License: https://www.icpsr.umich.edu/web/ICPSR/studies/39724/terms

Version Date: Mar 23, 2026 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Rebecca A. Hubbard, University of Pennsylvania

https://doi.org/10.3886/ICPSR39724.v1

Version V1

Slide tabs to view more

Summary View help for Summary

Researchers can use data from electronic health records, or EHRs, in studies that compare two or more treatments. In these studies, researchers need to identify all patients with the same phenotype. Phenotypes are a person's known traits, like height and weight, or known health problems, like diabetes. However, in EHR data, some data on patient traits or health problems may be missing for some patients.

Missing data in EHRs make it hard to correctly identify all patients with the same phenotype. It's even harder when data are missing due to a patient's health status. For example, patients with uncontrolled diabetes may need more lab tests than patients with controlled diabetes. As a result, researchers who are looking at lab tests may not identify patients with controlled diabetes as having diabetes.

In this project, the research team developed and tested a new statistical method that accounts for missing EHR data to estimate patient phenotypes.

To access the methods and software, please visit the bias_correction GitHub repository.

Citation View help for Citation

Hubbard, Rebecca A. Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021. Inter-university Consortium for Political and Social Research [distributor], 2026-03-23. https://doi.org/10.3886/ICPSR39724.v1

Export Citation:

RIS (generic format for RefWorks, EndNote, etc.)
EndNote

Funding View help for Funding

Patient-Centered Outcomes Research Institute (PCORI) (ME-1511-32666)

Subject Terms View help for Subject Terms

databases diabetes medical records statistical models treatment outcome

Distributor(s) View help for Distributor(s)

Inter-university Consortium for Political and Social Research

Hide

Time Period(s) View help for Time Period(s)

2016 -- 2021

Hide

Study Purpose View help for Study Purpose

To develop and evaluate a latent class model for estimating phenotypes using EHR data

Study Design View help for Study Design

The research team developed a Bayesian latent class model for predicting a patient's phenotype. The model combined information on data availability and observed data values for each patient to estimate a latent, or unobserved, phenotype. The model assumed that the latent phenotype was correlated with model covariates, like biomarkers, clinical diagnosis codes, prescription medications, age, and gender.

The research team then simulated EHR data for 1,000 patients to resemble a sample of patients at high risk for type 2 diabetes. They introduced two patterns of missing data in the biomarkers: missing at random and missing not at random. The team evaluated the model using the simulated data. They compared the model's performance with existing phenotype estimation methods based on (1) biomarkers only, (2) clinical codes only, (3) biomarkers and clinical codes, (4) biomarkers with missing values replaced via MI, and (5) biomarkers and clinical codes with missing biomarker values replaced via MI. For each method, the team calculated sensitivity, specificity, and the proportion of patients misclassified relative to an actual type 2 diabetes diagnosis.

Clinicians, patients, and caregivers helped design the study.

Data Source View help for Data Source

Simulated data resembling a cohort of 1,000 patients at risk for type 2 diabetes

Hide

Original Release Date View help for Original Release Date

2026-03-23

Hide

Notes

The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.

This study is maintained and distributed by the Patient-Centered Outcomes Data Repository (PCODR). PCODR is the official data repository of the Patient-Centered Outcomes Research Initiative (PCORI).

Statistical Methods for Phenotype Estimation and Analysis Using Electronic Health Records [Methods Study], 2016-2021 (ICPSR 39724)

Project Description

Summary View help for Summary

Citation View help for Citation

Funding View help for Funding

Subject Terms View help for Subject Terms

Distributor(s) View help for Distributor(s)

Scope of Project

Time Period(s) View help for Time Period(s)

Methodology

Study Purpose View help for Study Purpose

Study Design View help for Study Design

Data Source View help for Data Source

Version(s)

Original Release Date View help for Original Release Date

Notes