Using Topic Segmentation to Enhance Concept Parsing and Identification of Negations [Methods Study], Massachusetts, 2019-2023 (ICPSR 39740)

Name: Using Topic Segmentation to Enhance Concept Parsing and Identification of Negations [Methods Study], Massachusetts, 2019-2023
Published: 2026-03-23
License: https://www.icpsr.umich.edu/web/ICPSR/studies/39740/terms

Version Date: Mar 23, 2026 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Alexander Turchin, Brigham and Women's Hospital

https://doi.org/10.3886/ICPSR39740.v1

Version V1

Slide tabs to view more

Summary View help for Summary

Clinical notes in electronic health records, or EHRs, may contain information that can help researchers study and compare treatments. But it takes researchers a lot of time to find information in EHR notes.

Natural language processing, or NLP, methods can help researchers find information in EHR notes. With NLP, computer programs read and identify written language to make it easier to sort and study. But in EHR notes, some sentences may contain more than one topic. Also, EHR notes may discuss a single topic over many sentences. In these cases, current NLP methods don't work well to find complete and accurate information about a specific topic.

In this study, the research team developed and tested new NLP methods to identify topics from EHR notes.

Citation View help for Citation

Turchin, Alexander. Using Topic Segmentation to Enhance Concept Parsing and Identification of Negations [Methods Study], Massachusetts, 2019-2023. Inter-university Consortium for Political and Social Research [distributor], 2026-03-23. https://doi.org/10.3886/ICPSR39740.v1

Export Citation:

RIS (generic format for RefWorks, EndNote, etc.)
EndNote

Funding View help for Funding

Patient-Centered Outcomes Research Institute (PCORI) (ME-2019C1-15328)

Subject Terms View help for Subject Terms

artificial intelligence cardiovascular disease medical records obesity research models tobacco use

Geographic Coverage View help for Geographic Coverage

United States Massachusetts

Distributor(s) View help for Distributor(s)

Inter-university Consortium for Political and Social Research

Hide

Time Period(s) View help for Time Period(s)

2019 -- 2023

Hide

Study Purpose View help for Study Purpose

To develop and test new NLP methods for extracting complete and accurate information from segmented topics in clinical notes

Study Design View help for Study Design

The research team developed two types of NLP approaches to detect topic segmentation. The first was the Transformer approach based on an NLP technique called Bidirectional Encoder Representations from Transformers (BERT). The team pre-trained BERT on general English and refined manually annotated EHR notes. The team built three types of Transformer methods: Base BERT, BioBERT, and Clinical BERT. The second NLP approach used software called Canary, to which the team added a new function to identify words that indicate a different topic.

Next, the research team tested how well each type of NLP approach worked to identify segmented topics in EHR clinical notes for three clinical case examples: bariatric surgery discussion, statin therapy nonacceptance, and tobacco use history. To train the methods, the team used data sets for each case example that included at least 2,500 manually annotated EHR notes. Then, the team compared the methods with a separate validated data set of manually annotated notes. To assess performance, the team measured recall, precision, and F1 scores.

Patients, clinicians, and data scientists provided input during the study.

Data Source View help for Data Source

NLP training data sets with at least 2,500 clinical notes from outpatient visits for each of three clinical cases (bariatric surgery discussion, statin therapy nonacceptance, and tobacco use history) from the EHR database from Massachusetts General Hospital and Brigham and Women's Hospital

EHR external validation data set from Johns Hopkins Medicine

Hide

Original Release Date View help for Original Release Date

2026-03-23

Hide

Notes

The public-use data files in this collection are available for access by the general public. Access does not require affiliation with an ICPSR member institution.
ICPSR usually offers files in multiple formats for researchers to be able to access data and documentation in formats that work well within their needs. If you have questions about the accessibility of materials distributed by ICPSR or require further assistance, please visit ICPSR’s Accessibility Center.

Using Topic Segmentation to Enhance Concept Parsing and Identification of Negations [Methods Study], Massachusetts, 2019-2023 (ICPSR 39740)

Project Description

Summary View help for Summary

Citation View help for Citation

Funding View help for Funding

Subject Terms View help for Subject Terms

Geographic Coverage View help for Geographic Coverage

Distributor(s) View help for Distributor(s)

Scope of Project

Time Period(s) View help for Time Period(s)

Methodology

Study Purpose View help for Study Purpose

Study Design View help for Study Design

Data Source View help for Data Source

Version(s)

Original Release Date View help for Original Release Date

Notes