Blog – DoReCo

DoReCo 2.0 published

We’re thrilled to announce the publication of DoReCo 2.0 today! This release features two new languages, Totoli (Sulawesi, Indonesia) and Gurindji (Northern Australia). It also brings significant improvements to the consistency of annotations, particularly in morpheme boundaries.

A heartfelt thank you (again) to all 53 teams of corpus creators for their tremendous work and for trusting us with processing and sharing their data.

With this release, we have also streamlined the governance of the DoReCo database, with Frank Seifart now serving as managing director in addition to editor. We continue to proudly include Ludger Paschen and Matt Stave as co-editors in recognition of their essential role in implementing the project plan as PostDocs in the DoReCo project from 2019 to 2022.

DoReCo as a CLDF (Cross-Linguistic Data Format) dataset

Good news for those doing quantitative analyses with DoReCo data: Most DoReCo data are now also distributed as a Cross-Linguistic Data Format CLDF dataset (https://github.com/cldf-datasets/doreco/releases), thanks to Robert Forkel. This format allowed analyzing the more than 2,000,000 time-aligned phones in DoReCo in an efficient and reproducible way in the recent NHB paper. Additionally, a pipeline is provided (https://github.com/cldf-datasets/doreco/blob/main/USAGE.md) to easily create CLDF versions also from those DoReCo datasets that cannot be distributed in modified form due license restrictions, as well as of datasets from future DoReCo releases.

Word initial consonant lengthening in 51 DoReCo languages

The first study using all 51 DoReCo languages, by Frederic Blum et al., was just published in Nature Human Behaviour at https://www.nature.com/articles/s41562-024-01988-4. It shows that word-initial consonants tend to be slightly lengthened – probably to support identification of word boundaries in the continuous stream of speech. It also once again shows the importance of taking diverse languages into account when for studying human language and speech. Thanks again to all DoReCo contributors for making this possible!

DoReCo 1.1

We have significantly improved consistency and accuracy of DoReCo data over the past couple of weeks, so we decided to share the results of this as DoReCo 1.1., which has been published today. Details on what has been changed can be found on the DoReCo GitHub issue tracker.

DoReCo database published

The DoReCo database is online! A huge THANKS to all corpus creators and project members for the enormous efforts in building it over the past 3.5 years and for making the inauguration event last week a big success. The DoReCo project is ending soon, but we’ll keep this website online for a while to inform you about publications resulting from DoReCo and other news.

29 July 2022 DoReCo Inauguration Ceremony

After three exciting and intense years of corpus-building, we are thrilled to announce the upcoming inauguration of the complete DoReCo database, with fully processed data sets on all 50 (plus one!) DoReCo languages. To mark this occasion, we invite everyone to a public event on 29 July 2022, 3:30-5:30pm CEST, to be held at the ZAS in Berlin and online. We are looking forward to a keynote address by Evangelia Adamou and we are also very happy that many of the DoReCo corpus contributors will be present to introduce themselves and the languages they work on. The program is as follows:
• 3:00 PM Admission (in person or online Zoom room)
• 3:30 PM Welcome by Manfred Krifka, statements by DoReCo PIs and Postdocs
• 4:00 PM Keynote by Evangelia Adamou
• 4:30 PM DoReCo contributors introduce themselves
• 5:15 PM DoReCo database goes online
• 5:30 PM Reception
Online or on-site attendance is free, but registration (by 22 July) is required:
Direct link to registration

DoReCo is going to Texas!

DoReCo members and friends are organizing a workshop on “Spoken- and Signed-language Corpus Studies in Linguistic Typology” at the 14th International Conference of the Association for Linguistic Typology, to be held at the University of Texas at Austin, USA, on 15-17 December, 2022 (see https://sites.google.com/view/alt2022/workshops). Submit your abstract by April 1 and come discuss with us!

Documentary linguistics and corpus phonetics now happily married

An edited volume on “Corpus-Based Typology With Spoken Language Corpora” just appeared (diamond open access) with a contribution by DoReCo PI Frank Seifart on “Combining documentary linguistics and corpus phonetics to advance corpus-based typology“, arguing that Documentary linguistics and corpus phonetics form a happy marriage, an example being – you guessed it – DoReCo! Check out the other excellent contributions, too.

Nikolaus Himmelmann Abralin ao vivo talk

Our advisory board member Nikolaus Himmelmann, who’s also a founding father of Documentary Linguistics and pioneer of corpus-based prosodic typology, will give a presentation on “Universals of Language 3.0” at Abralin ao vivo on Wednesday 12.01.2022, 5:00 PM (UTC) / 6:00 PM (a recording of the talk will also be available anytime after that on the website). Highly recommended! https://aovivo.abralin.org/lives/nikolaus-p-himmelmann/

Ludger Paschen replacement professorship

Congratulations to DoReCo PostDoc Ludger Paschen for the appointment of a six-months replacement professorship in Slavic linguistics at University of Potsdam. Lots of success there! We will dearly miss him until his return to DoReCo in March 2022.