Using Topic Segmentation to Enhance Concept Parsing and Identification of Negations [Methods Study], Massachusetts, 2019-2023 (ICPSR 39740)
Version Date: Mar 23, 2026 View help for published
Principal Investigator(s): View help for Principal Investigator(s)
Alexander Turchin, Brigham and Women's Hospital
https://doi.org/10.3886/ICPSR39740.v1
Version V1
Summary View help for Summary
Clinical notes in electronic health records, or EHRs, may contain information that can help researchers study and compare treatments. But it takes researchers a lot of time to find information in EHR notes.
Natural language processing, or NLP, methods can help researchers find information in EHR notes. With NLP, computer programs read and identify written language to make it easier to sort and study. But in EHR notes, some sentences may contain more than one topic. Also, EHR notes may discuss a single topic over many sentences. In these cases, current NLP methods don't work well to find complete and accurate information about a specific topic.
In this study, the research team developed and tested new NLP methods to identify topics from EHR notes.
Citation View help for Citation
Export Citation:
Funding View help for Funding
Subject Terms View help for Subject Terms
Geographic Coverage View help for Geographic Coverage
Distributor(s) View help for Distributor(s)
Study Purpose View help for Study Purpose
To develop and test new NLP methods for extracting complete and accurate information from segmented topics in clinical notes
Study Design View help for Study Design
The research team developed two types of NLP approaches to detect topic segmentation. The first was the Transformer approach based on an NLP technique called Bidirectional Encoder Representations from Transformers (BERT). The team pre-trained BERT on general English and refined manually annotated EHR notes. The team built three types of Transformer methods: Base BERT, BioBERT, and Clinical BERT. The second NLP approach used software called Canary, to which the team added a new function to identify words that indicate a different topic.
Next, the research team tested how well each type of NLP approach worked to identify segmented topics in EHR clinical notes for three clinical case examples: bariatric surgery discussion, statin therapy nonacceptance, and tobacco use history. To train the methods, the team used data sets for each case example that included at least 2,500 manually annotated EHR notes. Then, the team compared the methods with a separate validated data set of manually annotated notes. To assess performance, the team measured recall, precision, and F1 scores.
Patients, clinicians, and data scientists provided input during the study.
Data Source View help for Data Source
NLP training data sets with at least 2,500 clinical notes from outpatient visits for each of three clinical cases (bariatric surgery discussion, statin therapy nonacceptance, and tobacco use history) from the EHR database from Massachusetts General Hospital and Brigham and Women's Hospital
EHR external validation data set from Johns Hopkins Medicine
Notes
The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.
