Home
/
Authors
/
Tanja Säily

Author

Tanja Säily

Bio: Tanja Säily is an academic researcher from University of Helsinki. The author has contributed to research in topics: Corpus linguistics & Language change. The author has an hindex of 11, co-authored 42 publications receiving 414 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Significance testing of word frequencies in corpora

[...]

Jefrey Lijffijt¹, Terttu Nevalainen², Tanja Säily², Panagiotis Papapetrou³, Kai Puolamäki⁴, Heikki Mannila⁵ - Show less +2 more•Institutions (5)

University of Bristol¹, University of Helsinki², Stockholm University³, Finnish Institute of Occupational Health⁴, Aalto University⁵

01 Jun 2016-Digital Scholarship in the Humanities

TL;DR: The significance estimates of various statistical tests are compared in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus to conclude that significance testing can be used to find consequential differences between corpora.

...read moreread less

Abstract: Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

...read moreread less

86 citations

Journal Article•DOI•

Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations

[...]

Tanja Säily

01 Jan 2011-Corpus Linguistics and Linguistic Theory

TL;DR: It is discovered that the category-conditioned degree of productivity P is unusable when comparing subcorpora based on social groups, and that hapax legomena remain a theoretically well-founded component of productivity measures.

...read moreread less

Abstract: The first aim of this work is to examine gender-based variation in the productivity of the nominal suffixes -ness and -ity in present-day British English. Possible interpretations are presented for the findings that -ity is used less productively by women, while with -ness there is no gender difference. The second aim is to analyse the validity of hapax-based measures of productivity in sociolinguistic research. It is discovered that they require a significantly larger corpus than type-based ones, and that the category-conditioned degree of productivity P is unusable when comparing subcorpora based on social groups. Otherwise, hapax legomena remain a theoretically well-founded component of productivity measures.

...read moreread less

51 citations

Journal Article•DOI•

Variation in noun and pronoun frequencies in a sociohistorical corpus of English

[...]

Tanja Säily¹, Terttu Nevalainen¹, Harri Siirtola²•Institutions (2)

University of Helsinki¹, University of Tampere²

01 Jun 2011-Literary and Linguistic Computing

TL;DR: The reliability of part-of-speech tagging in a diachronic corpus, and shifts in tag ratios over time are considered, both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results.

...read moreread less

Abstract: Many corpus linguists make the tacit assumption that part-of-speech frequencies remain constant during the period of observation. In this article, we will consider two related issues: (1) the reliability of part-of-speech tagging in a diachronic corpus, and (2) shifts in tag ratios over time. The purpose is both to serve the users of the corpus by making them aware of potential problems, and to obtain linguistically interesting results. We use noun and pronoun ratios as diagnostics indicative of opposing stylistic tendencies, but we are also interested in testing whether any observed variation in the ratios could be accounted for in sociolinguistic terms. The material for our study is provided by the Parsed Corpus of Early English Correspondence (PCEEC), which consists of 2.2 million running words covering the period 1415–1681. The part-of-speech tagging of the PCEEC has its problems, which we test by reannotating the corpus according to our own principles and comparing the two annotations. While there are quite a few changes, the mean percentage of change is very small for both nouns and pronouns. As for variation over time, the mean frequency of nouns declines somewhat, while the mean frequency of pronouns fluctuates with no clear diachronic trend. However, women consistently use more pronouns than men, while men use more nouns than women. More fine-grained distinctions are needed to uncover further regularities and possible reasons for this variation.

...read moreread less

29 citations

Book Chapter•DOI•

Comparing type counts: The case of women, men and -ity in early English letters

[...]

Tanja Säily¹, Jukka Suomela²•Institutions (2)

University of Helsinki¹, Helsinki Institute for Information Technology²

01 Jan 2009

TL;DR: An open source computer program is developed which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance, and is able to confirm the hypothesis that the productivity of -ity, as measured by type counts, is significantly low in letters written by women.

...read moreread less

Abstract: This work is a case study of applying nonparametric statistical methods to corpus data. We show how to use ideas from permutation testing to answer linguistic questions related to morphological productivity and type richness. In particular, we study the use of the suffixes -ity and -ness in the 17th-century part of the Corpus of Early English Correspondence within the framework of historical sociolinguistics. Our hypothesis is that the productivity of -ity, as measured by type counts, is significantly low in letters written by women. To test such hypotheses, and to facilitate exploratory data analysis, we take the approach of computing accumulation curves for types and hapax legomena. We have developed an open source computer program which uses Monte Carlo sampling to compute the upper and lower bounds of these curves for one or more levels of statistical significance. By comparing the type accumulation from women’s letters with the bounds, we are able to confirm our hypothesis.

...read moreread less

28 citations

How to turn linguistic data into evidence

[...]

Turo Hiltunen, Joseph McVeigh, Tanja Säily

18 Dec 2017

27 citations

1
2
3
4
…
5
6
7
8
9
10

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

The Cambridge Grammar of the English Language

[...]

H.G.A. Hughes

01 Jan 2003

1,739 citations

Journal Article•DOI•

Analyzing linguistic data: a practical introduction to statistics using R

[...]

Elisabeth Dévière¹•Institutions (1)

Katholieke Universiteit Leuven¹

16 Apr 2009-Journal of Applied Statistics

TL;DR: The author guides the reader in about 350 pages from descriptive and basic statistical methods over classification and clustering to (generalised) linear and mixed models to enable researchers and students alike to reproduce the analyses and learn by doing.

...read moreread less

Abstract: The complete title of this book runs ‘Analyzing Linguistic Data: A Practical Introduction to Statistics using R’ and as such it very well reflects the purpose and spirit of the book. The author guides the reader in about 350 pages from descriptive and basic statistical methods over classification and clustering to (generalised) linear and mixed models. Each of the methods is introduced in the context of concrete linguistic problems and demonstrated on exciting datasets from current research in the language sciences. In line with its practical orientation, the book focuses primarily on using the methods and interpreting the results. This implies that the mathematical treatment of the techniques is held at a minimum if not absent from the book. In return, the reader is provided with very detailed explanations on how to conduct the analyses using R [1]. The first chapter sets the tone being a 20-page introduction to R. For this and all subsequent chapters, the R code is intertwined with the chapter text and the datasets and functions used are conveniently packaged in the languageR package that is available on the Comprehensive R Archive Network (CRAN). With this approach, the author has done an excellent job in enabling researchers and students alike to reproduce the analyses and learn by doing. Another quality as a textbook is the fact that every chapter ends with Workbook sections where the user is invited to exercise his or her analysis skills on supplemental datasets. Full solutions including code, results and comments are given in Appendix A (30 pages). Instructors are therefore very well served by this text, although they might want to balance the book with some more mathematical treatment depending on the target audience. After the introductory chapter on R, the book opens on graphical data exploration. Chapter 3 treats probability distributions and common sampling distributions. Under basic statistical methods (Chapter 4), distribution tests and tests on means and variances are covered. Chapter 5 deals with clustering and classification. Strangely enough, the clustering section has material on PCA, factor analysis, correspondence analysis and includes only one subsection on clustering, devoted notably to hierarchical partitioning methods. The classification part deals with decision trees, discriminant analysis and support vector machines. The regression chapter (Chapter 6) treats linear models, generalised linear models, piecewise linear models and a substantial section on models for lexical richness. The final chapter on mixed models is particularly interesting as it is one of the few text book accounts that introduce the reader to using the (innovative) lme4 package of Douglas Bates which implements linear mixed-effects models. Moreover, the case studies included in this

...read moreread less

1,679 citations

Journal Article•DOI•

What is Corpus Linguistics

[...]

Stefan Th. Gries¹•Institutions (1)

University of California¹

01 Sep 2009-Language and Linguistics Compass

TL;DR: This article answers a few questions that corpus linguists regularly face from linguists who have not used corpus-based methods so far and discusses some of the central assumptions, notions, and methods of corpus linguistics.

...read moreread less

Abstract: Corpus linguistics is one of the fastest-growing methodologies in contemporary linguistics. In a conversational format, this article answers a few questions that corpus linguists regularly face from linguists who have not used corpus-based methods so far. It discusses some of the central assumptions (‘formal distributional differences reflect functional differences’), notions (corpora, representativity and balancedness, markup and annotation), and methods of corpus linguistics (frequency lists, concordances, collocations), and discusses a few ways in which the discipline still needs to mature. At a recent LSA meeting … [with an obvious bow to Frederick Newmeyer] Question: So, I hear you’re a corpus linguist. Interesting, I get to see more and more

...read moreread less

463 citations

Word Frequency Distributions

[...]

R.H. Baayen

01 Jan 2001

TL;DR: This paper presents a meta-modelling framework for estimating the randomness of word frequency distributions using a variety of non-parametric and Parametric models.

...read moreread less

Abstract: 1. Word Frequencies. 2. Non-parametric models. 3. Parametric models. 4. Mixture distributions. 5. The Randomness Assumption. 6. Examples of Applications. A. List of Symbols. B. Solutions of the exercises. C. Software. D. Data sets. Bibliography. Index.

...read moreread less

422 citations

Other•DOI•

Social Network Analysis: Computer Programs

[...]

Stanley Wasserman, Katherine Faust

01 Jan 1994

419 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

Collapse