HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024 (ICPSR 39448)

Name: HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024
Published: 2025-08-19
License: https://www.icpsr.umich.edu/web/ICPSR/studies/39448/terms

Version Date: Aug 19, 2025 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Laureano Moro-Velazquez, Johns Hopkins University

https://doi.org/10.3886/ICPSR39448.v1

Version V1 (see more versions)

V4 [2026-04-08]
V3 [2026-03-18] unpublished
V1 [2025-08-19] unpublished

You are viewing an older version of this study. A newer version is available (view all versions)

Additional details may be in the Version History or Data Collection Notes fields of the study metadata.

Slide tabs to view more

Summary View help for Summary

HeyJay! is a restricted-access study consisting of speech audio files and associated metadata, including file-level annotations and participant-level information. HeyJay! is a new corpus of atypical speech from participants with neurodegenerative disorders, including Parkinson's Disease, Ataxias, or Amyotrophic Lateral Sclerosis.

The current corpus version contains more than 8,500 utterance recordings encompassing supervised transcriptions and intent annotations. Additionally, it includes speech quality ratings for each participant, performed by three expert speech and language pathologists. This corpus, the first one with intent annotation of atypical speech that is publicly available, is intended to create more fair speech technologies for atypical speakers by adapting and improving the state of the art and to enable further research in the field.

Citation View help for Citation

Moro-Velazquez, Laureano. HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024. Inter-university Consortium for Political and Social Research [distributor], 2025-08-19. https://doi.org/10.3886/ICPSR39448.v1

Export Citation:

RIS (generic format for RefWorks, EndNote, etc.)
EndNote

Funding View help for Funding

Johns Hopkins University + Amazon Initiative for Interactive Artificial Intelligence (AI2AI)

Subject Terms View help for Subject Terms

adults atypical speech neurodegenerative disorder speech impairment

Geographic Coverage View help for Geographic Coverage

United States

Restrictions View help for Restrictions

This data collection may not be used for any purpose other than statistical reporting and analysis. Use of these data to learn the identity of any person or establishment is prohibited. To protect respondent privacy, the data files in this collection are restricted from general dissemination. To obtain these restricted files, researchers must agree to the terms and conditions of a Restricted Data Use Agreement.

Distributor(s) View help for Distributor(s)

Inter-university Consortium for Political and Social Research

Hide

Time Period(s) View help for Time Period(s)

2023-02-01 -- 2024-05-31

Date of Collection View help for Date of Collection

2023-02-01 -- 2024-05-31

Data Collection Notes View help for Data Collection Notes

Some speakers recorded sentences that overlapped with transcription and intent of the SLURP dataset. For additional information, please visit the SLURP repository.
For more information on the Fluent Speech Commands dataset, please visit the Fluent.ai website.

Hide

Study Purpose View help for Study Purpose

The purpose of the study was to use part of HeyJay! to create synthetic speech on new intent domains (to replicate SLURP), train SLU models, and then evaluate the resulting models with speech from actual atypical speakers.

Study Design View help for Study Design

Participants were asked to record a minimum of 320 complex commands, although not all participants finished the assigned tasks. Responses were recorded using the open-source web platform, Hermespeech Recorder.

The recordings took place in two different scenarios:

In the clinic, supervised a research coordinator and used the same computer equipped with a Focusrite Scarlett external soundcard and an AKG headset condenser microphone.
At home using the participant's own computer and microphone. In these cases, the participants received a video with clear recording instructions followed by a short training over videoconference with a research assistant that ensured that they understood the recording requirements. All participants first recorded three test utterances that were reviewed by the researchers of this project to ensure that the recording conditions were optimal and no significant background noise or channel distortions were present. In the cases in which researchers observed recording issues, participants were guided on how to solve them before starting the actual recordings.

Some participants were recorded in one scenario of their choice or both. All recordings had a single channel and a sampling rate of 22 kHz. Participants were presented with various sentences in English to be read as if they were talking to a speech assistant named Jay. For that reason, all sentences started with the wakeword Hey, Jay which gives the name to the corpus. The sentences overlapped in intent and transcriptions with those from the Fluent Speech Commands dataset (FSC) for 18 participants. The rest of the speakers recorded sentences overlapped with transcription and intent with the SLURP (short for Spoken Language Understanding Resource Package) dataset. These two references were selected as they represent basic (FSC) and challenging (SLURP) benchmarks of SLU systems. This will also allow researchers to combine HeyJay! with these two benchmarks to train larger models in future experiments.

In the portion of HeyJay! compatible with the FSC dataset (HeyJay-FSC), the intent of an utterance was labeled using three slots:

Action
Object
Location

For instance, if the utterance is "Hey Jay, turn off the living room's lights" the action will be "turn off", the object, "lights", and the location "living room". As in the original FSC dataset, HeyJay-FSC contains 248 unique phrases and 31 different intents.

On the other hand, the sentence annotation from SLURP (HeyJay-SLURP) follows a more complex hierarchical structure:

Scenario: Represents the general domain or context of the utterance (e.g., "calendar", "weather").
Intent: Captures the user's specific objective within that domain (e.g., "calendar/set_event", "weather/temperature").
Slots and Actions: Semantic slot annotations that identify key elements in the utterance such as times, places, names, or items that are essential to fulfilling the task.

For instance, in the utterance "Set a reminder for 3 PM tomorrow", the system would tag it under the "reminder" scenario, assign it the intent "reminder/set", and recognize "3 PM May 12 2025" as a time-related slot. The SLURP dataset offers a collection of intent annotations tailored for goal-directed spoken interactions across common real-world domains, including topics like weather, music, travel, and cooking. This comprehensive annotation approach makes SLURP particularly valuable for developing and benchmarking SLU models that need to interpret both user intent and detailed semantic information.

Time Method View help for Time Method

Cross-sectional

Universe View help for Universe

Adults with confirmed neuromotor disorders and perceived speech impairment. The speech impairments ranged from mild to severe, and the diagnoses include Parkinson's Disease, Ataxia (Episodic, Spinocerebellar, Cerebellar), Multiple System Atrophy, Amyotrophic Lateral Sclerosis, Stroke, and Dystonia.

Unit(s) of Observation View help for Unit(s) of Observation

Individuals

Data Type(s) View help for Data Type(s)

audio: sound data clinical data

Mode of Data Collection View help for Mode of Data Collection

cognitive assessment test

Description of Variables View help for Description of Variables

The data includes variables about disease etiology, severity of the speech impairment, and annotation of the quality of voice of each participant. In particular, the overall severity of atypical speech characteristics included overall dysarthria severity, overall articulatory severity, imprecise consonant articulation, prolonged phonemes, repeated phonemes, irregular articulatory breakdowns, presence of distorted vowels, overall voice quality, harsh voice, hoarse/wet voice, breathy voice, strained/ strangled voice, voice stoppages, and voice flutter. Demographic variables include age and sex.

Presence of Common Scales View help for Presence of Common Scales

Each of the three expert speech and language pathologists received a set of eight audio samples per speaker and evaluated the overall severity of atypical speech characteristics. Ratings ranged from 0 to 4, with 0 representing typical speech and 4 reflecting a highly impaired condition, in accordance with the Rating Scale for Deviant Speech Characteristics. Not all pathologists were able to evaluate all participants or traits.

Hide

Original Release Date View help for Original Release Date

2025-08-19

Version History View help for Version History

2025-08-25 Updated processing note in ICPSR Codebooks and metadata.

2025-08-19 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

Performed consistency checks.
Checked for undocumented or out-of-range codes.

Hide

HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024 (ICPSR 39448)

Project Description

Summary View help for Summary

Citation View help for Citation

Funding View help for Funding

Subject Terms View help for Subject Terms

Geographic Coverage View help for Geographic Coverage

Restrictions View help for Restrictions

Distributor(s) View help for Distributor(s)

Scope of Project

Time Period(s) View help for Time Period(s)

Date of Collection View help for Date of Collection

Data Collection Notes View help for Data Collection Notes

Methodology

Study Purpose View help for Study Purpose

Study Design View help for Study Design

Time Method View help for Time Method

Universe View help for Universe

Unit(s) of Observation View help for Unit(s) of Observation

Data Type(s) View help for Data Type(s)

Mode of Data Collection View help for Mode of Data Collection

Description of Variables View help for Description of Variables

Presence of Common Scales View help for Presence of Common Scales

Version(s)

Original Release Date View help for Original Release Date

Version History View help for Version History