scispace - formally typeset
Search or ask a question
Proceedings Article

Experiences from the Spoken Dutch Corpus Project

TL;DR: LREC 2002, the Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, 29 mei 2002.
Abstract: LREC 2002. Third International Conference on Language Resources and Evaluation. Las Palmas de Gran Canaria, 29 mei 2002

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
01 Oct 2007
TL;DR: A global analysis of the TADPOLE system shows that it is able to process text in linear time close to an estimated 2,500 words per second, while maintaining sufficient accuracy.
Abstract: We describe TADPOLE, a modular memory-based morphosyntactic tagger and dependency parser for Dutch. Though primarily aimed at being accurate, the design of the system is also driven by optimizing speed and memory usage, using a trie-based approximation of k-nearest neighbor classification as the basis of each module. We perform an evaluation of its three main modules: a part-of-speech tagger, a morphological analyzer, and a dependency parser, trained on manually annotated material available for Dutch – the parser is additionally trained on automatically parsed data. A global analysis of the system shows that it is able to process text in linear time close to an estimated 2,500 words per second, while maintaining sufficient accuracy.

196 citations


Cites background from "Experiences from the Spoken Dutch C..."

  • ...…Dutch tagged with the Spoken Dutch Corpus tagset (Van Eynde 2004): The approximately ninemillion word of the transcribed Spoken Dutch Corpus itself (Oostdijk et al. 2002), the ILK corpus with approximately 46 thousand part-of-speech tagged words, the D-Coi corpus with approximately 330 thousand…...

    [...]

Journal ArticleDOI
TL;DR: This paper investigated individual differences in linguistic knowledge and processing skills relate to individuals differences in speaking fluency and found that linguistic skills were most strongly related to average syllable duration, of which 50% of individual variance was explained.
Abstract: This study investigated how individual differences in linguistic knowledge and processing skills relate to individual differences in speaking fluency. Speakers of Dutch as a second language (N = 179) performed eight speaking tasks, from which several measures of fluency were derived such as measures for pausing, repairing, and speed (mean syllable duration). In addition, participants performed separate tasks, designed to gauge individuals’ second language linguistic knowledge and linguistic processing speed. The results showed that the linguistic skills were most strongly related to average syllable duration, of which 50% of individual variance was explained; in contrast, average pausing duration was only weakly related to linguistic knowledge and processing skills.

152 citations


Cites background or methods from "Experiences from the Spoken Dutch C..."

  • ...The names belonged to the 2200 most frequent lemmas in the CGN (Oostdijk et al., 2002)....

    [...]

  • ...For Part 1, 9 words were selected from each frequency band of 1000 words between words ranked 1 to 10,000 according to the Corpus of Spoken Dutch (CGN; Oostdijk et al., 2002)....

    [...]

Journal ArticleDOI
TL;DR: The research on the role of frequency in speech production to voice assimilation is broadened and clusters from a corpus of read speech were more often perceived as unassimilated in lower-frequency words and as either completely voiced (regressive assimilation) or, unexpectedly, as completely voiceless (progressive assimilation).
Abstract: Acoustic duration and degree of vowel reduction are known to correlate with a word’s frequency of occurrence. The present study broadens the research on the role of frequency in speech production to voice assimilation. The test case was regressive voice assimilation in Dutch. Clusters from a corpus of read speech were more often perceived as unassimilated in lower-frequency words and as either completely voiced (regressive assimilation) or, unexpectedly, as completely voiceless (progressive assimilation) in higher-frequency words. Frequency did not predict the voice classifications over and above important acoustic cues to voicing, suggesting that the frequency effects on the classifications were carried exclusively by the acoustic signal. The duration of the cluster and the period of glottal vibration during the cluster decreased while the duration of the release noises increased with frequency. This indicates that speakers reduce articulatory effort for higher-frequency words, with some acoustic cues signaling more voicing and others less voicing. A higher frequency leads not only to acoustic reduction but also to more assimilation.

70 citations

Journal ArticleDOI
TL;DR: The phoneme intelligibility scores of dysarthric speakers obtained by the three investigated intelligibility model types are reliable and the intelligibility scoring system is now ready to be implemented in a clinical tool.
Abstract: Background: Currently, clinicians mainly rely on perceptual judgements to assess intelligibility of dysarthric speech. Although often highly reliable, this procedure is subjective with a lot of intrinsic variables. Therefore, certain benefits can be expected from a speech technology‐based intelligibility assessment. Previous attempts to develop an automated intelligibility assessment mainly relied on automatic speech recognition (ASR) systems that were trained to recognize the speech of persons without known impairments. In this paper automatic speech alignment (ASA) systems are used instead. In addition, previous attempts only made use of phonemic features (PMF). However, since articulation is an important contributing factor to intelligibility of dysarthric speech and since phonological features (PLF) are shared by multiple phonemes, phonological features may be more appropriate to characterize and identify dysarthric phonemes.Aims: To investigate the reliability of objective phoneme intelligibility sco...

67 citations


Cites methods from "Experiences from the Spoken Dutch C..."

  • ...The systems used in this study were trained on the read speech parts of the Spoken Dutch Corpus (CGN) (Oostdijk et al. 2002) and the CoGeN corpus (Demuynck et al. 1997)....

    [...]

Journal ArticleDOI
TL;DR: Overall, it is found that reduction is more pervasive in spontaneous Dutch than previously documented.

63 citations


Cites background or methods from "Experiences from the Spoken Dutch C..."

  • ...The CGN corpus (Oostdijk et al., 2002) does not contain sufficient speech data to train acoustic models for these sounds....

    [...]

  • ...…phone models used for all alignments presented here were 37 32-Gaussian tristate monophone acoustic models (Hämäläinen, Gubian, ten Bosch, & Boves, 2009) that had been trained on 396,187 word tokens of the Dutch Library of the Blind of the Spoken Dutch Corpus (CGN, Oostdijk et al., 2002)....

    [...]

  • ...We transformed the transcriptions to the standards developed in the CGN project (Oostdijk et al., 2002)....

    [...]

  • ...The acoustic phone models used for all alignments presented here were 37 32-Gaussian tristate monophone acoustic models (Hämäläinen, Gubian, ten Bosch, & Boves, 2009) that had been trained on 396,187 word tokens of the Dutch Library of the Blind of the Spoken Dutch Corpus (CGN, Oostdijk et al., 2002)....

    [...]

  • ...It was compiled by merging lexical resources such as CELEX (Baayen, Piepenbrock, & Gulikers, 1995), RBN (van der Vliet, 2007) and CGN (Oostdijk et al., 2002)....

    [...]

References
More filters
Journal ArticleDOI
Jacob Cohen1
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

34,965 citations

Book
01 Jan 1994

861 citations


"Experiences from the Spoken Dutch C..." refers methods in this paper

  • ...The design of the headers has been inspired by the guidelines of the Text Encoding Initiative (SperbergMcQueen and Burnard, 1994) and the Corpus Encoding Standard (Ide, 1996)....

    [...]

Book
01 Jan 1998
TL;DR: This textbook is designed to provide a detailed understanding of the principles and practices underlying the use of large language corpora in exploratory learning and English language teaching and research and on the search tool SARA (SGML Aware Retrieval Application).
Abstract: This textbook is designed to provide a detailed understanding of the principles and practices underlying the use of large language corpora in exploratory learning and English language teaching and research. It focuses on the largest and most representative corpus of spoken and written data yet compiled - the British National Corpus - and on the search tool SARA (SGML Aware Retrieval Application). The method adopted is to provide a graded series of exercises, each introducing at the same time new features of the software and new techniques or applications for computer-assisted language learning. The book also includes an overview of previous work in corpus linguistics, a bibliography, and a reference manual for the SARA software. * Graded self-paced tutorials * Suggestions for further work * Thorough coverage of corpus linguistics theories and practices * State-of-the-art software * Accessible non-specialist style

394 citations

Book
01 Jan 1984

367 citations


"Experiences from the Spoken Dutch C..." refers methods in this paper

  • ...For the partof-speech distinction we employ the classical classification into ten parts of speech, which is also used in the standard reference grammar for Dutch Algemene Nederlandse Spraakkunst (ANS; Haeseryn et al., 1997)....

    [...]

Proceedings Article
01 May 2000
TL;DR: The Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch, a valuable resource for research in the fields of computational linguistics and language and speech technology.
Abstract: In this paper the Spoken Dutch Corpus project is presented, a joint Flemish-Dutch undertaking aimed at the compilation and annotation of a 10-million-word corpus of spoken Dutch. Upon completion, the corpus will constitute a valuable resource for research in the fields of computational linguistics and language and speech technology. The paper first gives an overall description of the project, its aims, structure and organization. It then goes on to discuss the considerations % both methodological and practical % that have played a role in the design of the corpus as well as in its compilation and annotation. The paper concludes with an account of the data that are available in the first release of the first part of the corpus that came out on March 1st, 2000.

244 citations