Last week DoReCo held our scheduled workshop in Berlin to officially kick off the project, as well as to reflect on our experiences with processing corpora for the past six months. The workshop was attended by corpus creators of nearly twenty languages spoken on six continents, representatives of various language archives, as well as representatives of a number of related initiatives. It was a great opportunity to bring together many important voices from the fields of archival and corpus linguistics, language documentation, and typology to discuss the mission of the DoReCo project.
Discussions at the workshop were immensely helpful for clarifying how we can
best make the DoReCo corpus available to the scientific community: for
instance, which kinds of file formats to provide to the community, how to
standardize data output, and what metadata would be most helpful for future
users of the DoReCo web portal.
We also discussed broader and sometimes thornier issues related to archiving and research, such as how best to version data that is stored in multiple places, how to deal with licensing and citations, and what kinds of research questions can be answered with a reference corpus consisting of many, but relatively small, corpora.
There were also a number of new collaborations begun at the workshop, and
we’re looking forward to working on these together. A big thank you to everyone
We have now successfully set up a pipeline for forced segment alignment for our audio data in DoReCo using MAUS (Munich Automatic Segmentation), a service provided by our partner, the Bayerisches Archiv für Sprachsignale.
Forced alignment is a technique that automatically
aligns segments (speech sounds, as given in transcription files) to
acoustic events in a signal. Major sound classes such as vowels,
glides, and fricatives each have their own acoustic fingerprint by
which they can be identified in a continuous flow of speech by the
MAUS software. MAUS’s forced alignment algorithm is trained on
large sets of manually annotated data to learn which acoustic
patterns correspond to different sounds such as [p], [s], and [a].
For DoReCo, we carry out two rounds of forced
alignment: First, we send audio and transcription data that we
received from corpus donors to MAUS to get (rough) word alignments.
We then manually correct those word boundaries by listening through
recordings and inspecting spectrograms. We use these corrected word
boundaries as input for a second round of forced alignment, which
gives us reliable segment alignments within words.
below illustrates sentence-level alignment (provided by the corpus
creator), word-level alignment (created by MAUS and manually
corrected by the DoReCo team), and segment alignments in X-SAMPA
created by MAUS.
We are using this pipeline now to subsequently add word and segment alignments to all corpora donated to us so far (see Languages), and then to future donations.
The latest additions to the DoReCo sample include languages with fascinating and complex morphologies (e.g. Urum, Movima, Lower Sorbian) – we’re excited to investigate how this complexity will play out w.r.t. speech rate and information rate; see http://doreco.info/project
Kicking off DoReCo – our new cross-linguistic corpus project. Teams in Berlin & Lyon are now complete (http://doreco.info/team/). We’ve already received corpora from 20 languages (http://doreco.info/languages/). Preparing them for automatic time-alignment. Stay tuned for updates.