Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

Machine learning

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

https://dl.acm.org/doi/pdf/10.1145/3442188.3445922

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? &#x1f99c;

Language and the Internet

Kim, Hyesouk. 2004. Theories and Developments of Sociolinguistics. The Sociolinguistic Journal of Korea, 12(1). The purpose of this article is to understand a current status of sociolinguistics by reviewing previous studies and attempting to see the future of the discipline as linguistics. As Milroy and Milroy (1990:485) have defined, sociolinguistics is "the study of language as it is used by real speakers in social and situational contexts of use." It has four characteristics. (1)Those who study sociolinguistics are linguists but they have great interests in adding social variables to pure linguistics. Sociolinguists believe that criteria of correct language usage be based upon not only pure grammatical standards but also societal norms in terms of its relevance and general acceptance. (2)The goal of sociolinguistics is to identify a co-variance between language and society and to establish a theory of language performance. (3)Sociolinguistics regards synchronical and diachronical traits as an identical frame. (4)Sociolingustics pays attention to language usage in societal contexts and extends language competence, which is the main subject of pure linguistics, to communicative competence. D. Hymes predicts that the core areas of linguistics is actually sociolinguistics and, thus, the prefix 'socio' will not be necessary. Although we still have that prefix, it is true that sociolingusitics has already had its own identity and is growing rapidly as an independent discipline. In conclusion, this paper argues that sociolinguistics will receive more attention from linguists and play a key role in linguistics by explaining variation in language more systematically, and by interpreting and eliminating language conflict in everyday life.

사회언어학(Sociolinguistics)의 이론과 전개

Text-to-Speech Synthesis provides a complete, end-to-end account of the process of generating speech by computer. Giving an in-depth explanation of all aspects of current speech synthesis technology, it assumes no specialized prior knowledge. Introductory chapters on linguistics, phonetics, signal processing and speech signals lay the foundation, with subsequent material explaining how this knowledge is put to use in building practical systems that generate speech. Including coverage of the very latest techniques such as unit selection, hidden Markov model synthesis, and statistical text analysis, explanations of the more traditional techniques such as format synthesis and synthesis by rule are also provided. Weaving together the various strands of this multidisciplinary field, the book is designed for graduate students in electrical engineering, computer science, and linguistics. It is also an ideal reference for practitioners in the fields of human communication interaction and telephony.

/pdf/text-to-speech-synthesis-2y7hiwoykk.pdf

Text-to-Speech Synthesis

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the "language agnostic" status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Code-mixing is frequently observed in user generated content on social media, especially from multilingual users. The linguistic complexity of such content is compounded by presence of spelling variations, transliteration and non-adherance to formal grammar. We describe our initial efforts to create a multi-level annotated corpus of Hindi-English codemixed text collated from Facebook forums, and explore language identification, back-transliteration, normalization and POS tagging of this data. Our results show that language identification and transliteration for Hindi are two major challenges that impact POS tagging accuracy.

/pdf/pos-tagging-of-english-hindi-code-mixed-social-media-content-wzbx0d4taj.pdf

POS Tagging of English-Hindi Code-Mixed Social Media Content

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the “language agnostic” status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

/pdf/the-state-and-fate-of-linguistic-diversity-and-inclusion-in-5943v4m15m.pdf

Code-Mixing is a frequently observed phenomenon in social media content generated by multi-lingual users. The processing of such data for linguistic analysis as well as computational modelling is challenging due to the linguistic complexity resulting from the nature of the mixing as well as the presence of non-standard variations in spellings and grammar, and transliteration. Our analysis shows the extent of Code-Mixing in English-Hindi data. The classification of Code-Mixed words based on frequency and linguistic typology underline the fact that while there are easily identifiable cases of borrowing and mixing at the two ends, a large majority of the words form a continuum in the middle, emphasizing the need to handle these at different levels for automatic processing of the data.

/pdf/a-oei-am-borrowing-ya-mixing-a-an-analysis-of-english-hindi-w0ofjvv4wr.pdf

â€œI am borrowing ya mixing ?â€ An Analysis of English-Hindi Code Mixing in Facebook

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language. We present a computational technique for creation of grammatically valid artificial CM data based on the Equivalence Constraint Theory. We show that when training examples are sampled appropriately from this synthetic data and presented in certain order (aka training curriculum) along with monolingual and real CM data, it can significantly reduce the perplexity of an RNN-based language model. We also show that randomly generated CM data does not help in decreasing the perplexity of the LMs.

/pdf/language-modeling-for-code-mixing-the-role-of-linguistic-291kbeyjnv.pdf

Kalika Bali

Papers

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

POS Tagging of English-Hindi Code-Mixed Social Media Content

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

â€œI am borrowing ya mixing ?â€ An Analysis of English-Hindi Code Mixing in Facebook

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data