Talko – Speech Corpus of Swedish in Finland
Talko is a speech corpus of Swedish in Finland. It consists of audio files linked to annotation, i.e. transcriptions in two parallel levels and part-of-speech tagging.
The corpus consists of sociolinguistic interviews recorded in all parts of Swedish-language Finland.
Most of the material in the corpus consists of recordings from the project Spara det finlandssvenska talet. It was carried out between 2005 and 2008. Speakers from two age groups (20–30 years and 55–75 years), both male and female, were recorded in both rural and urban areas. The interviews generally lasted 40–60 minutes but 20 minutes long excerpts have been selected for the corpus.
The corpus also contains 28 shorter interviews recorded between 1959 and 1987 from the publication (book and cd) Från Pyttis till Nedervetil.
The recordings have been transcribed in a broad phonetic transcription as well as a standard orthographic transcription, which is later POS tagged.
For a more thorough compilation (in Swedish) of the different versions of Talko and their amount of tokens and hours, please see the Swedish website. The current version of the corpus is Talko 2.0. This version is from March 9th 2017.
The POS tagging is done with TreeTagger trained on the Stockholm-Umeå Corpus of written Swedish as well as on some manually corrected Talko data. This has gradually improved the result of the automatic tagging and compensates for differences between spoken and written Swedish and between Finland-Swedish and Sweden-Swedish.
Södergård, Lisa & Therese Leinonen, 2017. Talko – korpus över den talade svenskan i Finland: Korpusbygge i teori och praktik. In Ideologi, identitet, intervention. Nordisk dialektologi 10 (331–340), edit. by Jan-Ola Östman, Caroline Sandström, Pamela Gustavsson and Lisa Södergård. Helsinki: Department of Finnish, Finno-Ugrian and Scandinavian Studies at the University of Helsinki.