Forced Alignment using MAUS

We have now successfully set up a pipeline for forced segment alignment for our audio data in DoReCo using MAUS (Munich Automatic Segmentation), a service provided by our partner, the Bayerisches Archiv für Sprachsignale.

Forced alignment is a technique that automatically aligns segments (speech sounds, as given in transcription files) to acoustic events in a signal. Major sound classes such as vowels, glides, and fricatives each have their own acoustic fingerprint by which they can be identified in a continuous flow of speech by the MAUS software. MAUS’s forced alignment algorithm is trained on large sets of manually annotated data to learn which acoustic patterns correspond to different sounds such as [p], [s], and [a].

For DoReCo, we carry out two rounds of forced alignment: First, we send audio and transcription data that we received from corpus donors to MAUS to get (rough) word alignments. We then manually correct those word boundaries by listening through recordings and inspecting spectrograms. We use these corrected word boundaries as input for a second round of forced alignment, which gives us reliable segment alignments within words.

The figure below illustrates sentence-level alignment (provided by the corpus creator), word-level alignment (created by MAUS and manually corrected by the DoReCo team), and segment alignments in X-SAMPA created by MAUS.

We are using this pipeline now to subsequently add word and segment alignments to all corpora donated to us so far (see Languages), and then to future donations.