Statistical Methods for Missing Data in Large Observational Studies [Methods Study], Georgia, 2013-2018 (ICPSR 39526)
Version Date: Oct 27, 2025 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Qi Long, Emory University
https://doi.org/10.3886/ICPSR39526.v1
Version V1
Summary View help for Summary
Health registries record data about patients with a specific health problem. These data may include age, weight, blood pressure, health problems, medical test results, and treatments received. But data in some patient records may be missing. For example, some patients may not report their weight or all of their health problems.
Research studies can use data from health registries to learn how well treatments work. But missing data can lead to incorrect results. To address the problem, researchers often exclude patient records with missing data from their studies. But doing this can also lead to incorrect results. The fewer records that researchers use, the greater the chance for incorrect results.
Missing data also lead to another problem: it is harder for researchers to find patient traits that could affect diagnosis and treatment. For example, patients who are overweight may get heart disease. But if data are missing, it is hard for researchers to be sure that trait could affect diagnosis and treatment of heart disease.
In this study, the research team developed new statistical methods to fill in missing data in large studies. The team also developed methods to use when data are missing to help find patient traits that could affect diagnosis and treatment.
To access the methods, software, and R package, please visit the Long Research Group website.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Distributor(s) View help for Distributor(s)
Study Purpose View help for Study Purpose
This study is aimed to develop imputation methods and variable selection methods in the presence of missing data for large observational studies; develop software packages in R that implement the proposed methods; assess the proposed methods via simulation studies and analysis; and educate and train stakeholders and graduate students on statistical methods for missing data.
Study Design View help for Study Design
Missing data in large observational data sets can compromise analyses and hinder identification of important predictors of patient outcomes. Traditional methods for handling missing data, such as available-case or complete-case analysis, exclude patients with missing data from the analysis, which can bias results. Methodologists have developed statistical methods for substituting, or "imputing," missing values, but these methods are most suitable in small data sets. Further, limited research exists on variable selection approaches to use with imputed data.
The research team developed novel imputation and variable selection methods, and accompanying software, for handling missing data in large observational studies that include high-dimensional data, or data in which the number of variables may exceed the number of complete cases.
To construct imputation models, the team incorporated Bayesian lasso regression methods and direct and indirect application of regularized regression. To avoid over-tailoring the models to each specific missing pattern out of many missing patterns in the high-dimensional data, researchers extended these methods to multivariate imputation by chained equations (MICE) approaches. In addition, researchers examined novel variable selection approaches that combined bootstrap imputation with either stability selection or bootstrap lasso.
Through simulations using data sets created to mimic large observational studies, researchers evaluated the methods. They validated the methods empirically using data from the Georgia Coverdell Acute Stroke Registry (GCASR), which contains missing data for 60% of registry data elements. The team applied their procedures in modeling factors associated with the time between patients arriving at the hospital and receiving brain imaging (n=86,322) and with length of hospital stay (n=1,807) for acute stroke patients.
Advisers from the Georgia Department of Public Health, GCASR, and the Centers for Disease Control and Prevention met with researchers and provided feedback on study design and data analyses.
Data Source View help for Data Source
The Georgia Coverdell Acute Stroke Registry (GCASR) data on 86,322 clinically diagnosed acute stroke hospital admissions in Georgia between 2005 and 2013.
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
