Using Physician Behavioral Big Data for High Precision Fraud Prediction and Detection, United States, 2000-2019 (ICPSR 38811)

Version Date: Dec 2, 2025 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Sally S. Simpson, University of Maryland; Ritu Agarwal, Johns Hopkins University. Carey Business School; Guodong Gordon Gao, Johns Hopkins University. Carey Business School

https://doi.org/10.3886/ICPSR38811.v1

Version V1

Slide tabs to view more

This project used big data from non-clinical physician behavior. These include traffic violations, substance abuse, property ownership, stressors (e.g., bankruptcy and divorce), social media data, and other life events data. These variables, all based on public records, were used to construct a predictive model of Medicare fraud using machine learning techniques.

Simpson, Sally S., Agarwal, Ritu, and Gao, Guodong Gordon. Using Physician Behavioral Big Data for High Precision Fraud Prediction and Detection, United States, 2000-2019. Inter-university Consortium for Political and Social Research [distributor], 2025-12-02. https://doi.org/10.3886/ICPSR38811.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
United States Department of Justice. Office of Justice Programs. National Institute of Justice (2019-R2-CX-0016)

Zip Code

Access to these data is restricted. Users interested in obtaining these data must complete a Restricted Data Use Agreement, specify the reason for the request, and obtain IRB approval or notice of exemption for their research.

Inter-university Consortium for Political and Social Research
Hide

2000-01-01 -- 2020-12-31
2019-01-01 -- 2022-12-31
Hide

The study team aimed to provide new policy-relevant applications using data science to improve risk assessment of physician engagement in fraud. Specifically, they sought answers to the following questions:

  • Can models using big data on non-clinical physician behavior (e.g., illegal behavior, consumer complaints and malpractice, other disciplinary action, conspicuous spending, and life stressors) successfully predict engagement in fraud in the near-term future (1-5 years)?
  • Of these behavioral factors, which ones represent the greatest risk for fraud engagement?
  • Which machine learning algorithm is most accurate in predicting a physician's risk of engaging in fraud?
  • The sample consisted of physicians excluded from Medicare participation between 2015 through 2019 due to fraudulent activity and a control group of matched non-fraudulent physicians. Matching was initially based on five characteristics (i.e., gender, primary practice zip code, primary taxonomy, whether the physician was the singular owner of the practice, and degree credentials). Later age (+ or - 5 years) was added as a matching criteria. To get a complete control sample, the study team loosened the matching criteria somewhat, reclassifying some taxonomies (the National Provider Identifier Registry term for medical specialties) to be slightly broader (e.g., 'Internal Medicine' instead of 'Internal Medicine Cardiovascular Disease').

    Longitudinal: Panel

    Physicians in the United States that have been excluded from federally funded health programs and a matched sample of non-excluded physicians that share a similar location, age, gender, and credentials.

    Individual

    Both static and dynamic variables were acquired on the two groups of physicians in this study. Dynamic variables were collected from 2000 to 2019 and include information about physician's online reviews, political donations, transaction records, criminal records, services to Medicare beneficiaries, and background checks. Static variables were collected in 2020 and include physician demographic information, practices, geolocations, and background checks.

    Hide

    2025-12-02

    2025-12-02 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

    • Checked for undocumented or out-of-range codes.

    Hide

    Notes

    • The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.

    • ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.

    • One or more files in this data collection have special restrictions. Restricted data files are not available for direct download from the website; click on the Restricted Data button to learn more.