Using Physician Behavioral Big Data for High Precision Fraud Prediction and Detection, United States, 2000-2019 (ICPSR 38811)
Version Date: Dec 2, 2025 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Sally S. Simpson, University of Maryland;
Ritu Agarwal, Johns Hopkins University. Carey Business School;
Guodong Gordon Gao, Johns Hopkins University. Carey Business School
https://doi.org/10.3886/ICPSR38811.v1
Version V1
Summary View help for Summary
This project used big data from non-clinical physician behavior. These include traffic violations, substance abuse, property ownership, stressors (e.g., bankruptcy and divorce), social media data, and other life events data. These variables, all based on public records, were used to construct a predictive model of Medicare fraud using machine learning techniques.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Smallest Geographic Unit View help for Smallest Geographic Unit
Zip Code
Restrictions View help for Restrictions
Access to these data is restricted. Users interested in obtaining these data must complete a Restricted Data Use Agreement, specify the reason for the request, and obtain IRB approval or notice of exemption for their research.
Distributor(s) View help for Distributor(s)
Time Period(s) View help for Time Period(s)
Date of Collection View help for Date of Collection
Study Purpose View help for Study Purpose
The study team aimed to provide new policy-relevant applications using data science to improve risk assessment of physician engagement in fraud. Specifically, they sought answers to the following questions:
Sample View help for Sample
The sample consisted of physicians excluded from Medicare participation between 2015 through 2019 due to fraudulent activity and a control group of matched non-fraudulent physicians. Matching was initially based on five characteristics (i.e., gender, primary practice zip code, primary taxonomy, whether the physician was the singular owner of the practice, and degree credentials). Later age (+ or - 5 years) was added as a matching criteria. To get a complete control sample, the study team loosened the matching criteria somewhat, reclassifying some taxonomies (the National Provider Identifier Registry term for medical specialties) to be slightly broader (e.g., 'Internal Medicine' instead of 'Internal Medicine Cardiovascular Disease').
Time Method View help for Time Method
Universe View help for Universe
Physicians in the United States that have been excluded from federally funded health programs and a matched sample of non-excluded physicians that share a similar location, age, gender, and credentials.
Unit(s) of Observation View help for Unit(s) of Observation
Data Type(s) View help for Data Type(s)
Mode of Data Collection View help for Mode of Data Collection
Description of Variables View help for Description of Variables
Both static and dynamic variables were acquired on the two groups of physicians in this study. Dynamic variables were collected from 2000 to 2019 and include information about physician's online reviews, political donations, transaction records, criminal records, services to Medicare beneficiaries, and background checks. Static variables were collected in 2020 and include physician demographic information, practices, geolocations, and background checks.
HideOriginal Release Date View help for Original Release Date
2025-12-02
Version History View help for Version History
2025-12-02 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:
- Checked for undocumented or out-of-range codes.
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.
One or more files in this data collection have special restrictions. Restricted data files are not available for direct download from the website; click on the Restricted Data button to learn more.
