Blog

Congrats Asst Prof Easterday


It’s official: (Former) DoReCo project member Shelece Easterday will be
assistant professor at the University of Hawai’i. Congratulations,
Shelece! We’re looking forward to cooperating with you at U Hawai’i on
corpus-based, cross-linguistic studies on, e.g., phonological complexity.

50 languages!

Early into the second project year, we have now received data sets from more than 50 languages (see http://doreco.info/languages/). These data sets are currently at various stages of processing, but we have already fully processed and created alignments at the word and segment levels for the following five languages: Arapaho, Kamas, Svan, Urum, and Yongning Na. As the number of fully processed corpora grows, several exciting phonetic and morphological studies are already on their way, building on the research ideas described in http://doreco.info/project/. Stay tuned for more info!

DoReCo workflow @ LREC

We are proud to announce our latest publication, in which we describe in detail DoReCo’s data processing workflow:
Paschen, Ludger, François Delafontaine, Christoph Draxler, Susanne Fuchs, Matthew Stave & Frank Seifart (2020). Building a Time-Aligned Cross-Linguistic Reference Corpus from LanguageDocumentation Data (DoReCo). Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2657–2666. http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.324.pdf
For a list of all DoReCo publications, see http://doreco.info/publications/

PostDoc opportunity

DoReCo’s sister project QUEST in Berlin is looking for a PostDoc to work on optimizing fieldwork data for cross-linguistic research. We’re open to candidates proposing their own cross-linguistic, cross-corpus research questions for exploratory projects, using, e.g. DoReCo data. Check out details at https://cutt.ly/8yGespq

PhD opportunities

Our host institution in Berlin, Leibniz-ZAS, currently invites applications for PhD positions, deadline 15.3.2020: https://tinyurl.com/rou29a8. One possibility are PhD projects that exploit and further develop DoReCo. Potential applicants are welcome to contact Frank Seifart before applying. Spread the word among your students and colleagues!

Upcoming conference presentations

Come see DoReCo presenting our work at three venues over the next three months! On November 28-29 we will be presenting at the GDR-LIFT kick-off meeting in Orléans, France (https://gdr-lift.loria.fr/lift-kick-off-meeting-november-28-29-orleans/). On December 13 we will be at the Workshop on Rate and Rhythm in Speech Recognition in Nijmegen, Netherlands (https://www.mpi.nl/events/rate-and-rhythm-speech-recognition-r3). And on January 2-5 we will be presenting at the Linguistics Society of America in New Orleans, USA (https://www.linguisticsociety.org/event/lsa-2020-annual-meeting).

We will be reporting, among other things, on our work with the MAUS system for phonemic time-alignment, developed by our project partners in Munich. Currently, the DoReCo corpus contains data from 40 languages, 20 of which have already been time-aligned, and many more on the way. If you’ll be at any of these three events, talk to us to find out more!

DoReCo kick-off meeting

Last week DoReCo held our scheduled workshop in Berlin to officially kick off the project, as well as to reflect on our experiences with processing corpora for the past six months. The workshop was attended by corpus creators of nearly twenty languages spoken on six continents, representatives of various language archives, as well as representatives of a number of related initiatives. It was a great opportunity to bring together many important voices from the fields of archival and corpus linguistics, language documentation, and typology to discuss the mission of the DoReCo project.

Discussions at the workshop were immensely helpful for clarifying how we can best make the DoReCo corpus available to the scientific community: for instance, which kinds of file formats to provide to the community, how to standardize data output, and what metadata would be most helpful for future users of the DoReCo web portal.

We also discussed broader and sometimes thornier issues related to archiving and research, such as how best to version data that is stored in multiple places, how to deal with licensing and citations, and what kinds of research questions can be answered with a reference corpus consisting of many, but relatively small, corpora.

There were also a number of new collaborations begun at the workshop, and we’re looking forward to working on these together. A big thank you to everyone who participated.

Forced Alignment using MAUS

We have now successfully set up a pipeline for forced segment alignment for our audio data in DoReCo using MAUS (Munich Automatic Segmentation), a service provided by our partner, the Bayerisches Archiv für Sprachsignale.

Forced alignment is a technique that automatically aligns segments (speech sounds, as given in transcription files) to acoustic events in a signal. Major sound classes such as vowels, glides, and fricatives each have their own acoustic fingerprint by which they can be identified in a continuous flow of speech by the MAUS software. MAUS’s forced alignment algorithm is trained on large sets of manually annotated data to learn which acoustic patterns correspond to different sounds such as [p], [s], and [a].

For DoReCo, we carry out two rounds of forced alignment: First, we send audio and transcription data that we received from corpus donors to MAUS to get (rough) word alignments. We then manually correct those word boundaries by listening through recordings and inspecting spectrograms. We use these corrected word boundaries as input for a second round of forced alignment, which gives us reliable segment alignments within words.

The figure below illustrates sentence-level alignment (provided by the corpus creator), word-level alignment (created by MAUS and manually corrected by the DoReCo team), and segment alignments in X-SAMPA created by MAUS.

We are using this pipeline now to subsequently add word and segment alignments to all corpora donated to us so far (see Languages), and then to future donations.