scispace - formally typeset
Search or ask a question

Showing papers in "Corpus Linguistics and Linguistic Theory in 2018"


Journal ArticleDOI
TL;DR: This study compares the use of log-likelihood (LL), a probability statistic, and odds ratio (OR), an effect size statistic, for keyword identification and argues that the two methods produce different keywords applicable to research focusing on different purposes.
Abstract: Abstract Keyword analysis is used in a range of sub-disciplines of applied linguistics from genre analyses to critically-oriented studies for different purposes ranging from producing a general characterization of a genre to identifying text-specific ideological issues. This study compares the use of log-likelihood (LL), a probability statistic, and odds ratio (OR), an effect size statistic, for keyword identification and argues that the two methods produce different keywords applicable to research focusing on different purposes. Through two case studies, keyword analyses of advance fee scams against the British National Corpus and research articles in applied linguistics against research articles from other academic disciplines, we show that both the LL and OR keywords concern the aboutness of the corpus, but differ in their specificity and pervasiveness through the corpus. LL highlights words which are relatively common in general use serving genre purposes, whereas OR highlights more specialized words serving critically-oriented purposes. Methodological and practical contributions to keyword analysis are discussed.

72 citations


Journal ArticleDOI
TL;DR: A corpus-based study of recent change in the English way-construction, drawing on data from the 1830s to the 2000s finds that they all have gained in semantic diversity.
Abstract: Abstract This paper presents a corpus-based study of recent change in the English way-construction, drawing on data from the 1830s to the 2000s. Semantic change in the distribution of the construction is characterized by means of a distributional semantic model, which captures semantic similarity between verbs through their co-occurrence frequency with other words in the corpus. By plotting and comparing the semantic domain of the three senses of the construction at different points in time, it is found that they all have gained in semantic diversity. These findings are interpreted in terms of increases in schematicity, either of the verb slot or the motion component contributed by the construction.

44 citations


Journal ArticleDOI
TL;DR: In this article, a comparison of two statistical techniques that have been used to describe register variation: factor analysis (as used in Multi-Dimensional analysis, MDA) and canonical discriminant analysis (CDA) is presented.
Abstract: Abstract Previous theoretical and empirical research on register variation has argued that linguistic co-occurrence patterns have a highly systematic relationship to register differences, because they both share the same functional underpinnings. The goal of this study is to test this claim through a comparison of two statistical techniques that have been used to describe register variation: factor analysis (as used in Multi-Dimensional analysis, MDA) and canonical discriminant analysis (CDA). MDA and CDA have different statistical bases and thus give priority to different analytical considerations: linguistic co-occurrence in the case of MDA and the prediction of register differences in the case of CDA. Thus, there is no statistical reason to expect that the two techniques, if applied to the same corpus, will produce similar results. We hypothesize that although MDA and CDA approach register variation from opposite sides, they will produce similar results because both types of statistical patterns are motivated by underlying discourse functions. The present paper tests this claim through a case-study analysis of variation among web registers, applying MDA and CDA to analyze register variation in the same corpus of texts.

39 citations


Journal ArticleDOI
TL;DR: It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora).
Abstract: Using the Google Ngram Corpora for six different languages (including two varieties of English), a large-scale time series analysis is conducted. It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law (and the parameter of the Zipf law, all estimated by maximum likelihood) can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora). The analysis also reveals that there are important cross-linguistic differences. It is argued that the Zipf–Mandelbrot parameters can be used as a first indicator of diachronic linguistic change, but more thorough analyses should make use of the full spectrum of different lexical, syntactical and stylometric measures to fully understand the factors that actually drive those changes.

28 citations


Journal ArticleDOI
TL;DR: The objective of this article is to investigate the use of three of the most frequent pragmatic markers in English conversation in the London-Lund Corpus, i.e. “well”, “you know” and “I mean”.
Abstract: Abstract The objective of this article is to investigate the use of three of the most frequent pragmatic markers in English conversation in the London-Lund Corpus, i.e. “well”, “you know” and “I mean”. Specifically, the aim is to study the characteristics of the prosodic patterns and the Tone Unit position in the realization of pragmatic functions by the markers. The article combines the thorough analysis of the corpus data with the description of the function of these elements in the realization of Adaptive Context within the Dynamic Model of Meaning approach to pragmatics and communication.

19 citations


Journal ArticleDOI
TL;DR: It is demonstrated that accurate segmentation is in part dependent on the propositional content of text fragments, and that completely separating segmentation and annotation does not always yield text segments that correspond to the text units between which a conceptual relationship holds.
Abstract: Discourse segmentation is an important step in the process of annotating coherence relations. Ideally, implementing segmentation rules results in text segments that correspond to the units of thought related to each other. This paper demonstrates that accurate segmentation is in part dependent on the propositional content of text fragments, and that completely separating segmentation and annotation does not always yield text segments that correspond to the text units between which a conceptual relationship holds. In addition, it argues that elements belonging to the propositional content of the discourse should necessarily be included in the segmentation, but that inclusion of other text elements, for instance stance markers, should be optional.

14 citations


Journal ArticleDOI
TL;DR: This article examined verb-argument constructions (VACs) from learner and native-speaker corpora as well as psycholinguistic data to gain insights into second language speaker knowledge of VACs.
Abstract: Abstract This paper draws on data from learner and native-speaker corpora as well as psycholinguistic data to gain insights into second language speaker knowledge of English verb-argument constructions (VACs). For each of 34 VACs, L1 German and L1 Spanish advanced English learners’ and English native speakers’ dominant verb–VAC associations are examined based on data retrieved from the International Corpus of Learner English (ICLE), the Louvain International Database of Spoken English Interlanguage (LINDSEI), their respective Native Speaker (NS) reference corpora, and data collected in verbal fluency tasks in which participants complete VAC frames, such as, ‘she _______ with the…’ with verbs that come to mind. We compare findings from the different data sets and consider the strengths and limitations of each in relation to questions in usage-based second language acquisition and Construction Grammar.

13 citations


Journal ArticleDOI
TL;DR: It is argued that a good understanding of the construction at issue cannot circumvent the enormous variation in the expression of the genitive marker, and regular patterns can be discerned within the wide variation space, which is uncovered by using mixed-effects logistic regression.
Abstract: Abstract This article takes a usage-based perspective on the partitive genitive construction in Dutch (iets moois, ‘something beautiful’), which has previously drawn scholarly attention from a theoretical perspective, due to the challenges it presents to Dutch nominal morphosyntax. We will argue that a good understanding of the construction at issue cannot circumvent the enormous variation in the expression of the genitive marker. Within the wide variation space, regular patterns can be discerned, which we uncovered by using mixed-effects logistic regression. This approach allows us to assess the precise contribution of internal factors (e.g. length of the adjective, or the type of quantifier) and external factors (e.g. regional variety, or register), as well as their interactions. This article has three objectives then: first, it wants to contribute to the description of Dutch syntax, second it aspires to advance methodological standards in grammatical investigation, and third, it makes a theoretical plea for a usage-based perspective, with full recognition of variation.

12 citations


Journal ArticleDOI
TL;DR: It is shown that different instantiations of frequency help interpret the way variation is perceived and maintained by native speakers, and certain types of absolute frequency seem to have a dominant role in production tasks.
Abstract: If we can operationalize corpus frequency in multiple ways, using absolute values and proportional values, which of them is more closely connected with the behaviour of language users? In this contribution, we examine overabundant cells in morphological paradigms, and look at the contribution that frequency of occurrence can make to understanding the choices speakers make due to this richness. We look at ways of operationalizing the term frequency in data from corpora and native speakers: the proportional frequency of forms (i.e. percentage of time that a variant is found in corpus data considered as a proportion of all variants) and several interpretations of absolute frequency (i.e. the raw frequency of variants in data from the same corpus). Working with data from unmotivated morphological variation in Czech case forms, we show that different instantiations of frequency help interpret the way variation is perceived and maintained by native speakers. Proportional frequency seems most salient for speakers in forming their judgements, while certain types of absolute frequency seem to have a dominant role in production tasks.

9 citations


Journal ArticleDOI
TL;DR: The authors investigated register variation by Spanish users of English by comparing formal and informal speech from the Nijmegen Corpus of Spanish English that they created, which comprises speech from thirty-four Spanish speakers of English in interaction with Dutch confederates.
Abstract: English serves as a lingua franca in situations with varying degrees of formality. How formality affects non-native speech has rarely been studied. We investigated register variation by Spanish users of English by comparing formal and informal speech from the Nijmegen Corpus of Spanish English that we created. This corpus comprises speech from thirty-four Spanish speakers of English in interaction with Dutch confederates in two speech situations. Formality affected the amount of laughter and overlapping speech and the number of Spanish words. Moreover, formal speech had a more informational character than informal speech. We discuss how our findings relate to register variation in Spanish.

9 citations


Journal ArticleDOI
TL;DR: It is found that in both types of relative clauses, the more marked variant (which) is preferred in complex contexts, while the unmarked variant (that) is favored in contexts where the relative clause is short and more fully integrated with the NP it modifies.
Abstract: Abstract We investigate internal and stylistic factors affecting binary and ternary relativizer choice in subject (that vs which) and non-subject (that vs which vs zero) relative clauses. We employ a novel methodological approach to predicting relativizers: Bayesian regression modeling with the dimensional reduction of model inputs via factor analysis. Our factor analysis is motivated by the high degree of redundancy and collinearity in natural language data, while Bayesian regression models are robust to effects of data sparseness and (near) separation. We find that in both types of relative clauses, the more marked variant (which) is preferred in complex contexts, while the unmarked variant (that, or zero in NSRCs) is favored in contexts where the relative clause is short and more fully integrated with the NP it modifies. We also find that use of which is somewhat more sensitive to stylistic considerations in subject than in non-subject relative clauses, and that which correlates most strongly with features associated with lexical density, e. g. ‘nouniness’, rather than those often associated with formality, e. g. passivization and sentence length.

Journal ArticleDOI
TL;DR: To describe the variation between nominal and verbal gerunds in Early and Late Modern English, the bidimensional model outperformed the unidimensional one, showing that the aspectual-semantic distinctions between Modern English nominal andverbal gerundS are a matter of both aspect and temporal boundedness.
Abstract: This study present a corpus-based comparison of two aspectual-sematic classification models proposed in theoretical literature (unidimensional vs. bidimensional) by applying them to a set of nominal and verbal gerunds from the Modern English period. It (i) summarises the differences between unidimensional and bidimensional classification models and (ii) the potential problems associated with them. Despite the difficulties of studying semantic aspect in Present-day as well as historical data, this study will argue that, (iii) at least for deverbal nominalization patterns, it is possible to take a bidimensional approach and maintain a clear distinction between, on the one hand, aspect features of the nominalized situation (stativity/dynamicity, durativity/punctuality, and telicity/atelicity), and temporal boundedness of that situation. The question of which semantic classification model to use, then, is not so much one of which one is practically feasible in a corpus analysis, but rather which one is best suited to describe the attested variation. In order to determine the best model (in terms of parsimony and descriptive accuracy), (iv) the models were compared by means of ‘akaike weights’. To describe the variation between nominal and verbal gerunds in Early and Late Modern English, the bidimensional model outperformed the unidimensional one, showing that (v) the aspectual-semantic distinctions between Modern English nominal and verbal gerunds are a matter of both aspect and temporal boundedness.

Journal ArticleDOI
TL;DR: Results show that, despite their strong similarity, in some contexts ΔP is more predictive of hesitation placement than transitional probability, and neither ΔP nor any of the other association measures emerges as the universally best predictor.
Abstract: Abstract This paper explores the proposed benefits of ΔP (delta P) as a measure of collocation strength. Its focus is on contrasting ΔP with other, more commonly used, association measures, particularly transitional probabilities, but also mutual information and Lexical Gravity G. To this end, first the strong correlation between ΔP and transitional probability is illustrated with the help of two exemplary corpora. This is followed by an analysis of hesitation placement in spontaneous spoken English, based on the assumption that hesitations will not be placed within strong collocations. Results show that, despite their strong similarity, in some contexts ΔP is more predictive of hesitation placement than transitional probability. Yet neither ΔP nor any of the other association measures emerges as the universally best predictor. On the basis of these results, it is suggested that studies should always rely on several association measures.

Journal ArticleDOI
TL;DR: This article found that the frequency adverb always combined with the progressive aspect is typically used in negative evaluations expressing irritation, i.e., complaints, and attributed the propensity of always progressives to act similarly to the simple aspect and the latter to a cognitive phenomenon called the negativity bias.
Abstract: Abstract It is widely assumed that the frequency adverb always combined with the progressive aspect is typically used in negative evaluations expressing irritation, i.e., complaints. Adopting a cognitive-functional approach, I test this claim across six genres of Present Day English. Always progressives were coded according to their functions: Describe (neutral), Complain (negative), Lament (negative), or Praise (positive). Neutral, rather than negative, functions predominated in all genres, although negative functions outnumbered positive functions. I relate the former finding to the propensity of always progressives to act similarly to the simple aspect and the latter to a cognitive phenomenon called the negativity bias.

Journal ArticleDOI
TL;DR: A first computer-aided semantical analysis of Chinese climate change news discourse is offered, including the development and compilation of high-quality and updated English-Chinese bilingual terminologies for cross-lingual and cross-cultural environmental studies.
Abstract: Abstract This study offers a first computer-aided semantical analysis of Chinese climate change news discourse. It explored the validity and productivity of the automatic Lancaster Semantic Analysis System (USAS). While USAS has been well tested in a number of studies for the English language and that the system has various language versions, its Chinese version has not been explored sufficiently in applied linguistics and cross-lingual and cross-cultural studies. The Chinese variation of USAS (CH_USAS) was instrumental in the statistical data modelling and construction of a data-driven analytical model for Chinese climate change news discourse. The model testing produced a mixed result which revealed both the efficiency and areas for improvement of this useful automatic cross-lingual semantic analysis system including the development and compilation of high-quality and updated English-Chinese bilingual terminologies for cross-lingual and cross-cultural environmental studies.

Journal ArticleDOI
TL;DR: This work addresses the question of whether verb argument structure and word associations, which are related to the speakers’ organization of the mental lexicon, shape similarity between verbs in a congruent manner, a topic which has not been explored previously.
Abstract: Abstract Similarity, which plays a key role in fields like cognitive science, psycholinguistics and natural language processing, is a broad and multifaceted concept. In this work we analyse how two approaches that belong to different perspectives, the corpus view and the psycholinguistic view, articulate similarity between verb senses in Spanish. Specifically, we compare the similarity between verb senses based on their argument structure, which is captured through semantic roles, with their similarity defined by word associations. We address the question of whether verb argument structure, which reflects the expression of the events, and word associations, which are related to the speakers’ organization of the mental lexicon, shape similarity between verbs in a congruent manner, a topic which has not been explored previously. While we find significant correlations between verb sense similarities obtained from these two approaches, our findings also highlight some discrepancies between them and the importance of the degree of abstraction of the corpus annotation and psycholinguistic representations.