The TORGO Database: Acoustic and articulatory speech from speakers with dysarthria

The TORGO database of dysarthric articulation consists of aligned acoustics and measured 3D articulatory features from speakers with either cerebral palsy (CP) or amyotrophic lateral sclerosis (ALS), which are two of the most prevalent causes of speech disability (Kent and Rosen, 2004), and matchd controls. This database, called TORGO, is the result of a collaboration between the departments of Computer Science and Speech-Language Pathology at the University of Toronto and the Holland-Bloorview Kids Rehab hospital in Toronto.

Speakers. Both CP and ALS result in dysarthria, which is caused by disruptions in the neuro-motor interface. These disruptions distort motor commands to the vocal articulators, resulting in atypical and relatively unintelligible speech in most cases (Kent, 2000). This unintelligibility can significantly diminish the use of traditional automatic speech recognition (ASR) software. The inability of modern ASR to effectively understand dysarthric speech is a major problem, since the more general physical disabilities often associated with the condition can make other forms of computer input, such as keyboards or touch screens, especially difficult (Hosom et al, 2003).

Purpose. The TORGO database was originally primarily a resource for developing advanced models in ASR that are more suited to the needs of people with dysarthria, although it is also applicable to non-dysarthric speech. A primary reason for collecting detailed physiological information is to be able to explicitly learn 'hidden' articulatory parameters automatically via statistical pattern recognition. For example, recent research has shown that modelling conditional relationships between articulation and acoustics in Bayesian networks can reduce error by about 28% (Markov et al., 2006; Rudzicz, 2009) relative to acoustic-only models for regular speakers.

Content. This database represents the majority of all data recorded as part of this project. Certain subsets of the data have not been included, however, including but not limited to:

Acknowledgements. All data were recorded between 2008 and 2010 in Toronto, Canada. This work was funded by Bell University Labs, the Natural Sciences and Engineering Research Council of Canada (NSERC), and the University of Toronto. Equipment and space have been funded with grants from the Canada Foundation for Innovation, Ontario Innovation Trust and the Ministry of Research and Innovation.

In the associated paper, we provide additional statistics on the relations between disordered and control speech. For more information on the collection of this database, please consult our relevant publications, below, or contact Frank Rudzicz.

Additions or corrections are welcome.


Instrumentation and stimuli



Subject in AG500Coil positions

The collection of movement data and time-aligned acoustic data is carried out using the 3D AG500 electro-magnetic articulograph (EMA) system (Carstens Medizinelektronik GmbH, Lenglern, Germany) with fully-automated calibration. This system allows for 3D recordings of articulatory movements inside and outside the vocal tract, thus providing a detailed window on the nature and direction of speech related activity.

Here, six transmitters attached to a clear cube-shaped acrylic plastic structure (dimensions L 58.4 x W 53.3 x H 49.5 centimetres) generate alternating electromagnetic fields. Each transmitter coil has a characteristic oscillating frequency ranging from 7:5 to 13:75 kHz (Yunusova et al., 2009). As recommended by the manufacturer, the AG500 system is calibrated prior to each session subsequent to a minimum of a 3 hour warm-up time. It is reported that, at or close to the cube's centre, positional errors are significantly smaller (Yunusova et al., 2009) compared to the peripheral regions of the recording field within the cube. The subject positioning within the cube was aided visually by the 'Cs5view' real-time position display program (Carstens Medizinelektronik GmbH, Lenglern, Germany). This allowed the experimenter to continuously monitor the subject's position within the cube and thereby maintain low mean squared error values.

Sensor coils were attached to three points on the surface of the tongue, namely tongue tip (TT - 1 cm behind the anatomical tongue tip), the tongue middle (TM - 3 cm behind the tongue tip coil), and tongue back (TB - approximately 2 cm behind the tongue middle coil). A sensor for tracking jaw movements (JA) is attached to a custom mould made from polymer thermoplastic that fits the surface of the lower incisors and which is necessary for a more accurate and reproducible recording. Four additional coils are placed on the upper and lower lips (UL and LL) and the left and right corners of the mouth (LM and RM). Further coils are placed on the subject's forehead, nose bridge, and behind each ear above the mastoid bone for reference purposes and to record head motion. Except for the left and right mouth corners, all sensors that measure the vocal tract lie generally on the midsagittal plane on which most of the relevant motion of speech takes place. Sensors are attached by thin and light-weight cables to recording equipment but do not impede free motion of the head within the EMA cube. Many cerebrally palsied individuals require metal wheelchairs for transportation, but these individuals were easily moved to a wooden chair that does not interfere with the electromagnetic field for the purposes of recording.

All acoustic data are recorded simultaneously through two microphones. The first is an Acoustic Magic Voice Tracker array microphone with 8 recording elements generally arranged horizontally along a span of 45.7 cm. The device uses amplitude information at each microphone to pinpoint the physical location of the speaker within its 60-degree range and to reduce acoustic noise by spatial filtering and typical amplitude filtering in firmware. This microphone records audio at 44.1 kHz and is placed facing the participant at a distance of 61 cm. The second microphone is a head-mounted electret microphone which records audio at 22.1 kHz.


Prompts



All subjects read English text from a 19-inch LCD screen (shown above). One subject experienced some visual exhaustion near the end of one session, and therefore repeated a small section of verbal stimuli spoken by an experimenter. No discernible effect of this approach was measured. The stimuli were presented to the participants in randomized order from within fixed-sized collections of stimuli in order to avoid priming or dependency effects. Dividing the stimuli into collections in this manner guaranteed overlap between subjects who speak at vastly different rates. Stimuli are classified into the following categories:

Non-words

These are used to control for the baseline abilities of the dysarthric speakers, especially to gauge their articulatory control in the presence of plosives and prosody. Speakers are asked to perform the following:

Short words

These are useful for studying speech acoustics without the need for word boundary detection. This category includes the following:

Restricted sentences

In order to utilize lexical, syntactic, and semantic processing in ASR, full and syntactically correct sentences are recorded. These include the following: