Child psychiatrists, pediatricians, and other child clinicians need to have a solid understanding of child language development. There are at least four important reasons that make this necessary. First, slowing, arrest, and deviation of language development are highly associated with, and complicate the course of, child psychopathology. Second, language competence plays a crucial role in emotional and mood regulation, evaluation, and therapy. Third, language deficits are the most frequent underpinning of the learning disorders, ubiquitous in our clinical populations. Fourth, clinicians should not confuse the rich linguistic and dialectal diversity of our clinical populations with abnormalities in child language development. The challenge for the clinician becomes, then, how to get immersed in the captivating field of child language acquisition without getting overwhelmed by its conceptual and empirical complexity. In the past 50 years and since the seminal works of Roger Brown, Jerome Bruner, and Catherine Snow, child language researchers (often known as developmental psycholinguists) have produced a remarkable body of knowledge. Linguists such as Chomsky and philosophers such as Grice have strongly influenced the science of child language. One of the major tenets of Chomskian linguistics (known as generative grammar) is that children’s capacity to acquire language is “hardwired” with “universal grammar”—an innate language acquisition device (LAD), a language “instinct”—at its core. This view is in part supported by the assertion that the linguistic input that children receive is relatively dismal and of poor quality relative to the high quantity and quality of output that they manage to produce after age 2 and that only an advanced, innate capacity to decode and organize linguistic input can enable them to “get from here (prelinguistic infant) to there (linguistic child).” In “Constructing a Language,” Tomasello presents a contrasting theory of how the child acquires language: It is not a universal grammar that allows for language development. Rather, human cognition universals of communicative needs and vocal-auditory processing result in some language universals, such as nouns and verbs as expressions of reference and predication (p. 19). The author proposes that two sets of cognitive skills resulting from biological/phylogenetic adaptations are fundamental to the ontogenetic origins of language. These sets of inherited cognitive skills are intentionreading on the one hand and pattern-finding, on the other. Intention-reading skills encompass the prelinguistic infant’s capacities to share attention to outside events with other persons, establishing joint attentional frames, to understand other people’s communicative intentions, and to imitate the adult’s communicative intentions (an intersubjective form of imitation that requires symbolic understanding and perspective-taking). Pattern-finding skills include the ability of infants as young as 7 months old to analyze concepts and percepts (most relevant here, auditory or speech percepts) and create concrete or abstract categories that contain analogous items. Tomasello, a most prominent developmental scientist with research foci on child language acquisition and on social cognition and social learning in children and primates, succinctly and clearly introduces the major points of his theory and his views on the origins of language in the initial chapters. In subsequent chapters, he delves into the details by covering most language acquisition domains, namely, word (lexical) learning, syntax, and morphology and conversation, narrative, and extended discourse. Although one of the remaining domains (pragmatics) is at the core of his theory and permeates the text throughout, the relative paucity of passages explicitly devoted to discussing acquisition and proBOOK REVIEWS

Constructing a language: A usage-based theory of language acquisition

We describe a state-of-the-art sentiment analysis system that detects (a) the sentiment of short informal textual messages such as tweets and SMS (message-level task) and (b) the sentiment of a word or a phrase within a message (term-level task). The system is based on a supervised statistical text classification approach leveraging a variety of surface-form, semantic, and sentiment features. The sentiment features are primarily derived from novel high-coverage tweet-specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment-word hashtags and from tweets with emoticons. To adequately capture the sentiment of words in negated contexts, a separate sentiment lexicon is generated for negated words.

The system ranked first in the SemEval-2013 shared task 'Sentiment Analysis in Twitter' (Task 2), obtaining an F-score of 69.02 in the message-level task and 88.93 in the term-level task. Post-competition improvements boost the performance to an F-score of 70.45 (message-level task) and 89.50 (term-level task). The system also obtains state-of-the-art performance on two additional datasets: the SemEval-2013 SMS test set and a corpus of movie review excerpts. The ablation experiments demonstrate that the use of the automatically generated lexicons results in performance gains of up to 6.5 absolute percentage points.

/pdf/sentiment-analysis-of-short-informal-texts-4lbmfxinxh.pdf

Sentiment analysis of short informal texts

In Semantic Textual Similarity, systems rate the degree of semantic equivalence between two text snippets. This year, the participants were challenged with new data sets for English, as well as the introduction of Spanish, as a new language in which to assess semantic similarity. For the English subtask, we exposed the systems to a diversity of testing scenarios, by preparing additional OntoNotesWordNet sense mappings and news headlines, as well as introducing new genres, including image descriptions, DEFT discussion forums, DEFT newswire, and tweet-newswire headline mappings. For Spanish, since, to our knowledge, this is the first time that official evaluations are conducted, we used well-formed text, by featuring sentences extracted from encyclopedic content and newswire. The annotations for both tasks leveraged crowdsourcing. The Spanish subtask engaged 9 teams participating with 22 system runs, and the English subtask attracted 15 teams with 38 system runs.

/pdf/semeval-2014-task-10-multilingual-semantic-textual-1tp3uw8dic.pdf

SemEval-2014 Task 10: Multilingual Semantic Textual Similarity

This paper presents the task on the evaluation of Compositional Distributional Semantics Models on full sentences organized for the first time within SemEval2014. Participation was open to systems based on any approach. Systems were presented with pairs of sentences and were evaluated on their ability to predict human judgments on (i) semantic relatedness and (ii) entailment. The task attracted 21 teams, most of which participated in both subtasks. We received 17 submissions in the relatedness subtask (for a total of 66 runs) and 18 in the entailment subtask (65 runs).

/pdf/semeval-2014-task-1-evaluation-of-compositional-u0ikl9nnuw.pdf

SemEval-2014 Task 1: Evaluation of Compositional Distributional Semantic Models on Full Sentences through Semantic Relatedness and Textual Entailment

Considerable progress has been made in recent years in the development of dialogue systems that support robust and efficient human–machine interaction using spoken language. Spoken dialogue technology allows various interactive applications to be built and used for practical purposes, and research focuses on issues that aim to increase the system’s communicative competence by including aspects of error correction, cooperation, multimodality, and adaptation in context. This book gives a comprehensive view of state-of-the-art techniques that are used to build spoken dialogue systems. It provides an overview of the basic issues such as system architectures, various dialogue management methods, system evaluation, and also surveys advanced topics concerning extensions of the basic model to more conversational setups. The goal of the book is to provide an introduction to the methods, problems, and solutions that are used in dialogue system development and evaluation. It presents dialogue modelling and system development issues relevant in both academic and industrial environments and also discusses requirements and challenges for advanced interaction management and future research. vi KEywoRDS Spoken dialogue systems, multimodality, evaluation, error-handling, dialogue management, statistical method v MC_Jok nen_FM. ndd Achorn Internat onal 10/10/2009 04:18AM

https://www.morganclaypool.com/doi/suppl/10.2200/S00509ED1V01Y201305HLT023/suppl_file/Dagan_Ch1.pdf

Synthesis Lectures on Human Language Technologies

This article builds on a mathematical explanation of one the most prominent stylometric measures, Burrows’s Delta (and its variants), to understand and explain its working. Starting with the conceptual separation between feature selection, feature scaling, and distance measures, we have designed a series of controlled experiments in which we used the kind of feature scaling (various types of standardization and normalization) and the type of distance measures (notably Manhattan, Euclidean, and Cosine) as independent variables and the correct authorship attributions as the dependent variable indicative of the performance of each of the methods proposed. In this way, we are able to describe in some detail how each of these two variables interact with each other and how they influence the results. Thus we can show that feature vector normalization, that is, the transformation of the feature vectors to a uniform length of 1 (implicit in the cosine measure), is the decisive factor for the improvement of Delta proposed recently. We are also able to show that the information particularly relevant to the identification of the author of a text lies in the profile of deviation across the most frequent words rather than in the extent of the deviation or in the deviation of specific words only. .................................................................................................................................................................................

/pdf/understanding-and-explaining-delta-measures-for-authorship-31t4c5qlc6.pdf

Understanding and explaining Delta measures for authorship attribution

In this paper we describe SoMaJo, a rulebased tokenizer for German web and social media texts that was the best-performing system in the EmpiriST 2015 shared task with an average F1-score of 99.57. We give an overview of the system and the phenomena its rules cover, as well as a detailed error analysis. The tokenizer is available as free software.

/pdf/somajo-state-of-the-art-tokenization-for-german-web-and-2rxidrq60x.pdf

SoMaJo: State-of-the-art tokenization for German web and social media texts

This paper describes our approach to the SemEval-2013 task on “Sentiment Analysis in Twitter”. We use simple bag-of-words models, a freely available sentiment dictionary automatically extended with distributionally similar terms, as well as lists of emoticons and internet slang abbreviations in conjunction with fast and robust machine learning algorithms. The resulting system is resource-lean, making it relatively independent of a specific language. Despite its simplicity, the system achieves competitive accuracies of 0.70‐0.72 in detecting the sentiment of text messages. We also apply our approach to the task of detecting the contextdependent sentiment of individual words and phrases within a message.

/pdf/klue-simple-and-robust-methods-for-polarity-classification-13f9wyq3iy.pdf

KLUE: Simple and robust methods for polarity classification

Being able to quantify the semantic similarity between two texts is important for many practical applications. SemantiKLUE combines unsupervised and supervised techniques into a robust system for measuring semantic similarity. At the core of the system is a word-to-word alignment of two texts using a maximum weight matching algorithm. The system participated in three SemEval-2014 shared tasks and the competitive results are evidence for its usability in that broad field of application.

/pdf/semantiklue-robust-semantic-similarity-at-multiple-levels-2cu0m2xf7o.pdf

SemantiKLUE: Robust Semantic Similarity at Multiple Levels Using Maximum Weight Matching

Statistical association measures (AM) play an important role in the automatic extraction of collocations and multiword expressions from corpora, but many parameters governing their performance are still poorly understood. Systematic evaluation studies have produced conflicting recommendations for an optimal AM, and little attention
has been paid to other parameters such as the underlying corpus, the size of the co-occurrence context, or the application of a frequency threshold.
Our paper presents the results of a large-scale evaluation study covering 13 corpora, eight context sizes, four frequency thresholds, and 20 AMs against two different gold standards of lexical collocations. While the optimal choice of an AM depends strongly on the particular gold standard used, other parameters prove much more
robust: (i) small co-occurrence contexts are better than larger spans, and the best results are usually obtained from syntactic dependencies; (ii) corpus quality is more important than sheer size, but large Web corpora prove to be a valid substitute for the British National Corpus; (iii) frequency thresholds seem to be unnecessary in
most situations, as the statistical AMs successfully weed out rare and unreliable candidates; (iv) there is little interaction between the choice of AM and the other parameters.
In order to provide complete evidence for our observations to readers, we created an interactive Web-based application that allows users to manipulate all evaluation parameters and dynamically updates evaluation graphs and summaries.

Thomas Proisl

Papers

Understanding and explaining Delta measures for authorship attribution

SoMaJo: State-of-the-art tokenization for German web and social media texts

KLUE: Simple and robust methods for polarity classification

SemantiKLUE: Robust Semantic Similarity at Multiple Levels Using Maximum Weight Matching

E-VIEW-alation – a Large-scale Evaluation Study of Association Measures for Collocation Identification