scispace - formally typeset
Search or ask a question
Author

Tanja Säily

Bio: Tanja Säily is an academic researcher from University of Helsinki. The author has contributed to research in topics: Corpus linguistics & Language change. The author has an hindex of 11, co-authored 42 publications receiving 414 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: The significance estimates of various statistical tests are compared in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus to conclude that significance testing can be used to find consequential differences between corpora.
Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

86 citations

Journal ArticleDOI
TL;DR: It is discovered that the category-conditioned degree of productivity P is unusable when comparing subcorpora based on social groups, and that hapax legomena remain a theoretically well-founded component of productivity measures.
Abstract: The first aim of this work is to examine gender-based variation in the productivity of the nominal suffixes -ness and -ity in present-day British English. Possible interpretations are presented for the findings that -ity is used less productively by women, while with -ness there is no gender difference. The second aim is to analyse the validity of hapax-based measures of productivity in sociolinguistic research. It is discovered that they require a significantly larger corpus than type-based ones, and that the category-conditioned degree of productivity P is unusable when comparing subcorpora based on social groups. Otherwise, hapax legomena remain a theoretically well-founded component of productivity measures.

51 citations

Journal ArticleDOI
TL;DR: The reliability of part-of-speech tagging in a diachronic corpus, and shifts in tag ratios over time are considered, both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results.
Abstract: Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus, and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415–1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.

29 citations

Book ChapterDOI
01 Jan 2009
TL;DR: An open source computer program is developed which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance, and is able to confirm the hypothesis that the productivity of -ity, as measured by type counts, is significantly low in letters written by women.
Abstract: This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.

28 citations


Cited by
More filters
Journal ArticleDOI
01 Jan 2003

1,739 citations

Journal ArticleDOI
TL;DR: The author guides the reader in about 350 pages from descriptive and basic statistical methods over classification and clustering to (generalised) linear and mixed models to enable researchers and students alike to reproduce the analyses and learn by doing.
Abstract: The complete title of this book runs ‘Analyzing Linguistic Data: A Practical Introduction to Statistics using R’ and as such it very well reflects the purpose and spirit of the book. The author guides the reader in about 350 pages from descriptive and basic statistical methods over classification and clustering to (generalised) linear and mixed models. Each of the methods is introduced in the context of concrete linguistic problems and demonstrated on exciting datasets from current research in the language sciences. In line with its practical orientation, the book focuses primarily on using the methods and interpreting the results. This implies that the mathematical treatment of the techniques is held at a minimum if not absent from the book. In return, the reader is provided with very detailed explanations on how to conduct the analyses using R [1]. The first chapter sets the tone being a 20-page introduction to R. For this and all subsequent chapters, the R code is intertwined with the chapter text and the datasets and functions used are conveniently packaged in the languageR package that is available on the Comprehensive R Archive Network (CRAN). With this approach, the author has done an excellent job in enabling researchers and students alike to reproduce the analyses and learn by doing. Another quality as a textbook is the fact that every chapter ends with Workbook sections where the user is invited to exercise his or her analysis skills on supplemental datasets. Full solutions including code, results and comments are given in Appendix A (30 pages). Instructors are therefore very well served by this text, although they might want to balance the book with some more mathematical treatment depending on the target audience. After the introductory chapter on R, the book opens on graphical data exploration. Chapter 3 treats probability distributions and common sampling distributions. Under basic statistical methods (Chapter 4), distribution tests and tests on means and variances are covered. Chapter 5 deals with clustering and classification. Strangely enough, the clustering section has material on PCA, factor analysis, correspondence analysis and includes only one subsection on clustering, devoted notably to hierarchical partitioning methods. The classification part deals with decision trees, discriminant analysis and support vector machines. The regression chapter (Chapter 6) treats linear models, generalised linear models, piecewise linear models and a substantial section on models for lexical richness. The final chapter on mixed models is particularly interesting as it is one of the few text book accounts that introduce the reader to using the (innovative) lme4 package of Douglas Bates which implements linear mixed-effects models. Moreover, the case studies included in this

1,679 citations

Journal ArticleDOI
TL;DR: This article answers a few questions that corpus linguists regularly face from linguists who have not used corpus-based methods so far and discusses some of the central assumptions, notions, and methods of corpus linguistics.
Abstract: Corpus linguistics is one of the fastest-growing methodologies in contemporary linguistics. In a conversational format, this article answers a few questions that corpus linguists regularly face from linguists who have not used corpus-based methods so far. It discusses some of the central assumptions (‘formal distributional differences reflect functional differences’), notions (corpora, representativity and balancedness, markup and annotation), and methods of corpus linguistics (frequency lists, concordances, collocations), and discusses a few ways in which the discipline still needs to mature. At a recent LSA meeting … [with an obvious bow to Frederick Newmeyer] Question: So, I hear you’re a corpus linguist. Interesting, I get to see more and more

463 citations

01 Jan 2001
TL;DR: This paper presents a meta-modelling framework for estimating the randomness of word frequency distributions using a variety of non-parametric and Parametric models.
Abstract: 1. Word Frequencies. 2. Non-parametric models. 3. Parametric models. 4. Mixture distributions. 5. The Randomness Assumption. 6. Examples of Applications. A. List of Symbols. B. Solutions of the exercises. C. Software. D. Data sets. Bibliography. Index.

422 citations