Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

Machine learning

We consider the problem of classifying documents not by topic, but by overall sentiment, e.g., determining whether a review is positive or negative. Using movie reviews as data, we flnd that standard machine learning techniques deflnitively outperform human-produced baselines. However, the three machine learning methods we employed (Naive Bayes, maximum entropy classiflcation, and support vector machines) do not perform as well on sentiment classiflcation as on traditional topic-based categorization. We conclude by examining factors that make the sentiment classiflcation problem more challenging.

/pdf/thumbs-up-sentiment-classiflcation-using-machine-learning-1m4vdmh1b0.pdf

Thumbs up? Sentiment Classiflcation using Machine Learning Techniques

Microblogging today has become a very popular communication tool among Internet users. Millions of users share opinions on different aspects of life everyday. Therefore microblogging web-sites are rich sources of data for opinion mining and sentiment analysis. Because microblogging has appeared relatively recently, there are a few research works that were devoted to this topic. In our paper, we focus on using Twitter, the most popular microblogging platform, for the task of sentiment analysis. We show how to automatically collect a corpus for sentiment analysis and opinion mining purposes. We perform linguistic analysis of the collected corpus and explain discovered phenomena. Using the corpus, we build a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document. Experimental evaluations show that our proposed techniques are efficient and performs better than previously proposed methods. In our research, we worked with English, however, the proposed technique can be used with any other language.

/pdf/twitter-as-a-corpus-for-sentiment-analysis-and-opinion-zxo77yicta.pdf

Twitter as a Corpus for Sentiment Analysis and Opinion Mining

Knowledge Management

Sentiment analysis algorithms and applications: A survey

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form. Previous work in text mining focused at the word or the tag level. This paper presents an approach to performing text mining at the term level. The mining process starts by preprocessing the document collection and extracting terms from the documents. Each document is then represented by a set of terms and annotations characterizing the document. Terms and additional higher-level entities are then organized in a hierarchical taxonomy. In this paper we will describe the Term Extraction module of the Document Explorer system, and provide experimental evaluation performed on a set of 52,000 documents published by Reuters in the years 1995–1996.

/pdf/text-mining-at-the-term-level-1feus2211a.pdf

Text Mining at the Term Level

We present a bottom-up parsing algorithm for stochastic context-free grammars that is able (1) to deal with multiple interpretations of sentences containing compound words; (2) to extract N-most probable parses in O(n 3 ) and compute at the same time all possible parses of any portion of the input sequence with their probabilities; (3) to deal with out of vocabulary words. Explicitly extracting all the parse trees associated to a given input sentence depends on the complexity of the grammar, but even in the case where this number is exponential in n, the chart used by the algorithm for the representation is of O(n 2 ) space complexity.

A generalized CYK algorithm for parsing stochastic CFG

In the general framework of knowledge discovery, Data Mining techniques are usually dedicated to information extraction from structured databases. Text Mining techniques, on the other hand, are dedicated to information extraction from unstructured textual data and Natural Language Processing (NLP) can then be seen as an interesting tool for the enhancement of information extraction procedures. In this paper, we present two examples of Text Mining tasks, association extraction and prototypical document extraction, along with several related NLP techniques.

/pdf/text-mining-natural-language-techniques-and-text-mining-46ltkcujcn.pdf

Text Mining: Natural Language techniques and Text Mining applications

Knowledge Discovery in Databases (KDD), also known as data mining, focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them While most work on KDD has been concerned with structured databases, there has been little work on handling the huge amount of information that is available only in unstructured textual form Given a collection of text documents, most approaches to text mining perform knowledge-discovery operations on labels associated with each document At one extreme, these labels are keywords that represent the results of non-trivial keyword-labeling processes, and, at the other extreme, these labels are nothing more than a list of the words within the documents of interest This paper presents an intermediate approach, one that we call text mining at the term level, in which knowledge discovery takes place on a more focused collection of words and phrases that are extracted from and label each document These terms plus additional higher-level entities are then organized in a hierarchical taxonomy and are used in the knowledge discovery process This paper describes Document Explorer, our tool that implements text mining at the term level It consists of a document retrieval module, which converts retrieved documents from their native formats into documents represented using the SGML mark-up language used by Document Explorer; a two-stage term-extraction approach, in which terms are first proposed in a termgeneration stage, and from which a smaller set are then selected in a term-filtering stage in light of their frequencies of occurrence elsewhere in the collection; our taxonomy-creation tool by which the user can help specify higher-level entities that inform the knowledge-discovery process; and our knowledge-discovery tools for the resulting term-labeled documents Finally, we evaluate our approach on a collection of patent records as well as Reuters newswire stories Our results confirm that Text Mining serves as a powerful technique to manage knowledge encapsulated in large document collections

Knowledge Management: A Text Mining Approach

In the general context of Knowledge Discovery, specific techniques, called Text Mining techniques, are necessary to extract information from unstructured textual data. The extracted information can then be used for the classification of the content of large textual bases. In this paper, we present two examples of information that can be automatically extracted from text collections: probabilistic associations of key-words and prototypical document instances. The Natural Language Processing (NLP) tools necessary for such extractions are also presented.

Martin Rajman

Papers

Text Mining at the Term Level

A generalized CYK algorithm for parsing stochastic CFG

Text Mining: Natural Language techniques and Text Mining applications

Knowledge Management: A Text Mining Approach

Text Mining, knowledge extraction from unstructured textual data