Data Mining - Concepts and Techniques.

大規模要約資源としてのNew York Times Annotated Corpus

Considerable progress has been made in recent years in the development of dialogue systems that support robust and efficient human–machine interaction using spoken language. Spoken dialogue technology allows various interactive applications to be built and used for practical purposes, and research focuses on issues that aim to increase the system’s communicative competence by including aspects of error correction, cooperation, multimodality, and adaptation in context. This book gives a comprehensive view of state-of-the-art techniques that are used to build spoken dialogue systems. It provides an overview of the basic issues such as system architectures, various dialogue management methods, system evaluation, and also surveys advanced topics concerning extensions of the basic model to more conversational setups. The goal of the book is to provide an introduction to the methods, problems, and solutions that are used in dialogue system development and evaluation. It presents dialogue modelling and system development issues relevant in both academic and industrial environments and also discusses requirements and challenges for advanced interaction management and future research. vi KEywoRDS Spoken dialogue systems, multimodality, evaluation, error-handling, dialogue management, statistical method v MC_Jok nen_FM. ndd Achorn Internat onal 10/10/2009 04:18AM

https://www.morganclaypool.com/doi/suppl/10.2200/S00509ED1V01Y201305HLT023/suppl_file/Dagan_Ch1.pdf

Synthesis Lectures on Human Language Technologies

Text summarization technique can extract essential information from online reviews.Our method can identify top-k most informative sentences from online hotel reviews.We jointly considered author, review time, usefulness, and opinion factors.Online hotel reviews were collected from TripAdvisor in experimental evaluation.The results show that our approach provides more comprehensive hotel information. Online travel forums and social networks have become the most popular platform for sharing travel information, with enormous numbers of reviews posted daily. Automatically generated hotel summaries could aid travelers in selecting hotels. This study proposes a novel multi-text summarization technique for identifying the top-k most informative sentences of hotel reviews. Previous studies on review summarization have primarily examined content analysis, which disregards critical factors like author credibility and conflicting opinions. We considered such factors and developed a new sentence importance metric. Both the content and sentiment similarities were used to determine the similarity of two sentences. To identify the top-k sentences, the k-medoids clustering algorithm was used to partition sentences into k groups. The medoids from these groups were then selected as the final summarization results. To evaluate the performance of the proposed method, we collected two sets of reviews for the two hotels posted on TripAdvisor.com. A total of 20 subjects were invited to review the text summarization results from the proposed approach and two conventional approaches for the two hotels. The results indicate that the proposed approach outperforms the other two, and most of the subjects believed that the proposed approach can provide more comprehensive hotel information.

Opinion mining from online hotel reviews A text summarization approach

Word similarity is broadly used in many applications, such as information retrieval, information extraction, text classification, word sense disambiguation, example -based machine translation, etc. There are two different methods used to compute similarity: one is based on ontology or a semantic taxonomy; the other is based on collocations of words in a corpus. As a lexical knowledgebase with rich semantic information, How-net has been employed in various researches. Unlike other thesauri, such as WordNet and Tongyici Cilin, in which word similarity is defined based on the distance between words in a semantic taxonomy tree, How-net defines a word in a complicated multi-dimensional knowledge description language. As a result, a series of problems arise in the process of word similarity computation using How-net. The difficulties are outlined below: 1. The description of each word consists of a group of sememes. For example, the Chinese word “暗箱(camera obscura)” is described as: “part|部件, #TakePicture|拍攝, %tool|用具 , body|身”, and the Chinese word “寫信 (write a letter)” is described as: “write|寫, ContentProduct=letter|信件”; 2. The meaning of a word is not a simple combination of these sememes. Sememes are organized using a specific knowledge description language. To meet these challenges, our work includes: 1. A study on the How-net knowledge description language. We rewrite the How-net definition of a word in a more structural format, using the abstract data structure of set and feature structure. 2. A study on the algorithm used to compute word similarity based on How-net. The similarity between sememes, that between sets , and that between feature structures are given. To compute the similarity between two sememes, we

/pdf/ji-yu-zhi-wang-de-ci-hui-yu-yi-xiang-si-du-ji-suan-word-3q2xf1mp2c.pdf

基於《知網》的辭彙語義相似度計算 (Word Similarity Computing Based on How-net).

Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Inferring topics from the overwhelming amount of short texts becomes a critical but challenging task for many content analysis tasks. Existing methods such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet allocation (LDA) cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. This paper studies how to incorporate the external word correlation knowledge into short texts to improve the coherence of topic modeling. Based on recent results in word embeddings that learn semantically representations for words from a large corpus, we introduce a novel method, Embedding-based Topic Model (ETM), to learn latent topics from short texts. ETM not only solves the problem of very limited word co-occurrence information by aggregating short texts into long pseudo-texts, but also utilizes a Markov Random Field regularized model that gives correlated words a better chance to be put into the same topic. The experiments on real-world datasets validate the effectiveness of our model comparing with the state-of-the-art models.

Topic Modeling over Short Texts by Incorporating Word Embeddings

Text simplification (TS) is the technique of reducing the lexical, syntactical complexity of text. Existing automatic TS systems can simplify text only by lexical simplification or by manually defined rules. Neural Machine Translation (NMT) is a recently proposed approach for Machine Translation (MT) that is receiving a lot of research interest. In this paper, we regard original English and simplified English as two languages, and apply a NMT model–Recurrent Neural Network (RNN) encoder-decoder on TS to make the neural network to learn text simplification rules by itself. Then we discuss challenges and strategies about how to apply a NMT model to the task of text simplification.

https://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/download/11944/12251

Text simplification using Neural Machine Translation

There are two main categories of multi-document summarization: term-based and ontology-based methods. A term-based method cannot deal with the problems of polysemy and synonymy. An ontology-based approach addresses such problems by taking into account of the semantic information of document content, but the construction of ontology requires lots of manpower. To overcome these open problems, this paper presents a pattern-based model for generic multi-document summarization, which exploits closed patterns to extract the most salient sentences from a document collection and reduce redundancy in the summary. Our method calculates the weight of each sentence of a document collection by accumulating the weights of its covering closed patterns with respect to this sentence, and iteratively selects one sentence that owns the highest weight and less similarity to the previously selected sentences, until reaching the length limitation. The sentence weight calculation by patterns reduces the dimension and captures more relevant information. Our method combines the advantages of the term-based and ontology-based models while avoiding their weaknesses. Empirical studies on the benchmark DUC2004 datasets demonstrate that our pattern-based method significantly outperforms the state-of-the-art methods. Multi-document summarization can be used to extract a particular individual's opinions in the form of closed patterns, from this individual's documents shared in social networks, hence provides a useful tool for further analyzing the individual's behavior and influence in group activities.

Multi-document summarization using closed patterns

Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning. Recently unsupervised lexical simplification approaches only rely on the complex word itself regardless of the given sentence to generate candidate substitutions, which will inevitably produce a large number of spurious candidates. We present a simple LS approach that makes use of the Bidirectional Encoder Representations from Transformers (BERT) which can consider both the given sentence and the complex word during generating candidate substitutions for the complex word. Specifically, we mask the complex word of the original sentence for feeding into the BERT to predict the masked token. The predicted results will be used as candidate substitutions. Despite being entirely unsupervised, experimental results show that our approach obtains obvious improvement compared with these baselines leveraging linguistic databases and parallel corpus, outperforming the state-of-the-art by more than 12 Accuracy points on three well-known benchmarks.

Jipeng Qiang

Papers

Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

Topic Modeling over Short Texts by Incorporating Word Embeddings

Text simplification using Neural Machine Translation

Multi-document summarization using closed patterns

Lexical Simplification with Pretrained Encoders