DoReCo kick-off meeting

Last week DoReCo held our scheduled workshop in Berlin to officially kick off the project, as well as to reflect on our experiences with processing corpora for the past six months. The workshop was attended by corpus creators of nearly twenty languages spoken on six continents, representatives of various language archives, as well as representatives of a number of related initiatives. It was a great opportunity to bring together many important voices from the fields of archival and corpus linguistics, language documentation, and typology to discuss the mission of the DoReCo project.

Discussions at the workshop were immensely helpful for clarifying how we can best make the DoReCo corpus available to the scientific community: for instance, which kinds of file formats to provide to the community, how to standardize data output, and what metadata would be most helpful for future users of the DoReCo web portal.

We also discussed broader and sometimes thornier issues related to archiving and research, such as how best to version data that is stored in multiple places, how to deal with licensing and citations, and what kinds of research questions can be answered with a reference corpus consisting of many, but relatively small, corpora.

There were also a number of new collaborations begun at the workshop, and we’re looking forward to working on these together. A big thank you to everyone who participated.

Forced Alignment using MAUS

We have now successfully set up a pipeline for forced segment alignment for our audio data in DoReCo using MAUS (Munich Automatic Segmentation), a service provided by our partner, the Bayerisches Archiv für Sprachsignale.

Forced alignment is a technique that automatically aligns segments (speech sounds, as given in transcription files) to acoustic events in a signal. Major sound classes such as vowels, glides, and fricatives each have their own acoustic fingerprint by which they can be identified in a continuous flow of speech by the MAUS software. MAUS’s forced alignment algorithm is trained on large sets of manually annotated data to learn which acoustic patterns correspond to different sounds such as [p], [s], and [a].

For DoReCo, we carry out two rounds of forced alignment: First, we send audio and transcription data that we received from corpus donors to MAUS to get (rough) word alignments. We then manually correct those word boundaries by listening through recordings and inspecting spectrograms. We use these corrected word boundaries as input for a second round of forced alignment, which gives us reliable segment alignments within words.

The figure below illustrates sentence-level alignment (provided by the corpus creator), word-level alignment (created by MAUS and manually corrected by the DoReCo team), and segment alignments in X-SAMPA created by MAUS.

We are using this pipeline now to subsequently add word and segment alignments to all corpora donated to us so far (see Languages), and then to future donations.

Diverse morphologies

The latest additions to the DoReCo sample include languages with fascinating and complex morphologies (e.g. Urum, Movima, Lower Sorbian) – we’re excited to investigate how this complexity will play out w.r.t. speech rate and information rate; see http://doreco.info/project