HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024 (ICPSR 39448)

Version Date: Aug 19, 2025 View help for published

Principal Investigator(s): View help for Principal Investigator(s)
Laureano Moro-Velazquez, Johns Hopkins University

https://doi.org/10.3886/ICPSR39448.v1

Version V1 ()

  • V3 [2026-03-18]
  • V1 [2025-08-19] unpublished

You are viewing an older version of this study. A newer version is available ()

Additional details may be in the Version History or Data Collection Notes fields of the study metadata.

2025-08-25 Updated processing note in ICPSR Codebooks and metadata.

2025-08-19 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Performed consistency checks.
  • Checked for undocumented or out-of-range codes.

Slide tabs to view more

HeyJay! is a restricted-access study consisting of speech audio files and associated metadata, including file-level annotations and participant-level information. HeyJay! is a new corpus of atypical speech from participants with neurodegenerative disorders, including Parkinson's Disease, Ataxias, or Amyotrophic Lateral Sclerosis.

The current corpus version contains more than 8,500 utterance recordings encompassing supervised transcriptions and intent annotations. Additionally, it includes speech quality ratings for each participant, performed by three expert speech and language pathologists. This corpus, the first one with intent annotation of atypical speech that is publicly available, is intended to create more fair speech technologies for atypical speakers by adapting and improving the state of the art and to enable further research in the field.

Moro-Velazquez, Laureano. HeyJay!: A Corpus of Atypical Speech for Spoken Language Understanding and Automatic Speech Recognition, United States, 2023-2024. Inter-university Consortium for Political and Social Research [distributor], 2025-08-19. https://doi.org/10.3886/ICPSR39448.v1

Export Citation:

  • RIS (generic format for RefWorks, EndNote, etc.)
  • EndNote
Johns Hopkins University + Amazon Initiative for Interactive Artificial Intelligence (AI2AI)

This data collection may not be used for any purpose other than statistical reporting and analysis. Use of these data to learn the identity of any person or establishment is prohibited. To protect respondent privacy, the data files in this collection are restricted from general dissemination. To obtain these restricted files, researchers must agree to the terms and conditions of a Restricted Data Use Agreement.

Inter-university Consortium for Political and Social Research
Hide

2023-02-01 -- 2024-05-31
2023-02-01 -- 2024-05-31
  1. Some speakers recorded sentences that overlapped with transcription and intent of the SLURP dataset. For additional information, please visit the SLURP repository.
  2. For more information on the Fluent Speech Commands dataset, please visit the Fluent.ai website.
Hide

The purpose of the study was to use part of HeyJay! to create synthetic speech on new intent domains (to replicate SLURP), train SLU models, and then evaluate the resulting models with speech from actual atypical speakers.

Participants were asked to record a minimum of 320 complex commands, although not all participants finished the assigned tasks. Responses were recorded using the open-source web platform, Hermespeech Recorder.

The recordings took place in two different scenarios:

  • In the clinic, supervised a research coordinator and used the same computer equipped with a Focusrite Scarlett external soundcard and an AKG headset condenser microphone.
  • At home using the participant's own computer and microphone. In these cases, the participants received a video with clear recording instructions followed by a short training over videoconference with a research assistant that ensured that they understood the recording requirements. All participants first recorded three test utterances that were reviewed by the researchers of this project to ensure that the recording conditions were optimal and no significant background noise or channel distortions were present. In the cases in which researchers observed recording issues, participants were guided on how to solve them before starting the actual recordings.

Some participants were recorded in one scenario of their choice or both. All recordings had a single channel and a sampling rate of 22 kHz. Participants were presented with various sentences in English to be read as if they were talking to a speech assistant named Jay. For that reason, all sentences started with the wakeword Hey, Jay which gives the name to the corpus. The sentences overlapped in intent and transcriptions with those from the Fluent Speech Commands dataset (FSC) for 18 participants. The rest of the speakers recorded sentences overlapped with transcription and intent with the SLURP (short for Spoken Language Understanding Resource Package) dataset. These two references were selected as they represent basic (FSC) and challenging (SLURP) benchmarks of SLU systems. This will also allow researchers to combine HeyJay! with these two benchmarks to train larger models in future experiments.

In the portion of HeyJay! compatible with the FSC dataset (HeyJay-FSC), the intent of an utterance was labeled using three slots:

  • Action
  • Object
  • Location

For instance, if the utterance is "Hey Jay, turn off the living room's lights" the action will be "turn off", the object, "lights", and the location "living room". As in the original FSC dataset, HeyJay-FSC contains 248 unique phrases and 31 different intents.

On the other hand, the sentence annotation from SLURP (HeyJay-SLURP) follows a more complex hierarchical structure:

  1. Scenario: Represents the general domain or context of the utterance (e.g., "calendar", "weather").
  2. Intent: Captures the user's specific objective within that domain (e.g., "calendar/set_event", "weather/temperature").
  3. Slots and Actions: Semantic slot annotations that identify key elements in the utterance such as times, places, names, or items that are essential to fulfilling the task.

For instance, in the utterance "Set a reminder for 3 PM tomorrow", the system would tag it under the "reminder" scenario, assign it the intent "reminder/set", and recognize "3 PM May 12 2025" as a time-related slot. The SLURP dataset offers a collection of intent annotations tailored for goal-directed spoken interactions across common real-world domains, including topics like weather, music, travel, and cooking. This comprehensive annotation approach makes SLURP particularly valuable for developing and benchmarking SLU models that need to interpret both user intent and detailed semantic information.

Cross-sectional

Adults with confirmed neuromotor disorders and perceived speech impairment. The speech impairments ranged from mild to severe, and the diagnoses include Parkinson's Disease, Ataxia (Episodic, Spinocerebellar, Cerebellar), Multiple System Atrophy, Amyotrophic Lateral Sclerosis, Stroke, and Dystonia.

Individuals

The data includes variables about disease etiology, severity of the speech impairment, and annotation of the quality of voice of each participant. In particular, the overall severity of atypical speech characteristics included overall dysarthria severity, overall articulatory severity, imprecise consonant articulation, prolonged phonemes, repeated phonemes, irregular articulatory breakdowns, presence of distorted vowels, overall voice quality, harsh voice, hoarse/wet voice, breathy voice, strained/ strangled voice, voice stoppages, and voice flutter. Demographic variables include age and sex.

Each of the three expert speech and language pathologists received a set of eight audio samples per speaker and evaluated the overall severity of atypical speech characteristics. Ratings ranged from 0 to 4, with 0 representing typical speech and 4 reflecting a highly impaired condition, in accordance with the Rating Scale for Deviant Speech Characteristics. Not all pathologists were able to evaluate all participants or traits.

Hide

2025-08-19

2025-08-25 Updated processing note in ICPSR Codebooks and metadata.

2025-08-19 ICPSR data undergo a confidentiality review and are altered when necessary to limit the risk of disclosure. ICPSR also routinely creates ready-to-go data files along with setups in the major statistical software formats as well as standard codebooks to accompany the data. In addition to these procedures, ICPSR performed the following processing steps for this data collection:

  • Performed consistency checks.
  • Checked for undocumented or out-of-range codes.

Hide