Showing papers on "Noisy text analytics published in 2015"

PDF

Open Access

Proceedings Article•DOI•

Segmentation-free handwritten Chinese text recognition with LSTM-RNN

[...]

23 Aug 2015

TL;DR: Initial results on the use of Multi-Dimensional Long-Short Term Memory Recurrent Neural Networks (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters are presented.

...read moreread less

Abstract: We present initial results on the use of Multi-Dimensional Long-Short Term Memory Recurrent Neural Networks (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters. In fact, most of Chinese text recognizers in the literature perform a pre-segmentation of text image into characters. This can be a drawback, as explicit segmentation is an extra step before recognizing the text, and the errors made at this stage have direct impact on the performance of the whole system. MDLSTM-RNN is now a state-of-the-art technology that provides the best performance on languages with Latin and Arabic characters, hence we propose to apply RNN on Chinese text recognition. Our results on the data from the Task 4 in ICDAR 2013 competition for handwritten Chinese recognition are comparable in performance with the best reported systems.

...read moreread less

146 citations

Journal Article•

Arabic Text Classification Using Maximum Entropy

[...]

Alaa M. El-Halees

05 Dec 2015-IUG Journal of Natural Studies

TL;DR: In this article, the authors focused on classifying Arabic text documents and used a maximum entropy method to classify Arabic documents, they experimented their approach using real data, then compared the results with other existing systems.

...read moreread less

Abstract: In organizations, a large amount of information exists in text documents. Therefore, it is important to use text mining to discover knowledge from these unstructured data. Automatic text classification considered as one of important applications in text mining. It is the process of assigning a text document to one or more predefined categories based on their content. This paper focus on classifying Arabic text documents. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In our approach, we first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech. Then, we used maximum entropy method to classify Arabic documents. We experimented our approach using real data, then we compared the results with other existing systems.

...read moreread less

140 citations

Proceedings Article•DOI•

Short text understanding through lexical-semantic analysis

[...]

Wen Hua¹, Zhongyuan Wang¹, Haixun Wang², Kai Zheng³, Xiaofang Zhou³ - Show less +1 more•Institutions (3)

Renmin University of China¹, Google², University of Queensland³

13 Apr 2015

TL;DR: This work uses lexical-semantic knowledge provided by a well-known semantic network for short text understanding and uses knowledge-intensive approaches that focus on semantics in all these tasks.

...read moreread less

Abstract: Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing methods cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text processing such as topic modeling. Third, short texts are usually more ambiguous. We argue that knowledge is needed in order to better understand short texts. In this work, we use lexical-semantic knowledge provided by a well-known semantic network for short text understanding. Our knowledge-intensive approach disrupts traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are effective in harvesting semantics of short texts.

...read moreread less

138 citations

Journal Article•DOI•

Relevance Feature Discovery for Text Mining

[...]

Yuefeng Li¹, Abdulmohsen Algarni¹, Mubarak Albathan¹, Yan Shen¹, Moch Arif Bijaksana¹ - Show less +1 more•Institutions (1)

Queensland University of Technology¹

01 Jun 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: Substantial experiments show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods in text mining.

...read moreread less

Abstract: It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods.

...read moreread less

67 citations

Proceedings Article•DOI•

Paragraph text segmentation into lines with Recurrent Neural Networks

[...]

Bastien Moysset, Christopher Kermorvant, Christian Wolf¹, Jérôme Louradour•Institutions (1)

University of Lyon¹

23 Aug 2015

TL;DR: A new method to use more “agnostic” Machine Learning-based approaches to address text line location, inspired by the latest generation of optical models used for text recognition, namely Recurrent Neural Networks.

...read moreread less

Abstract: The detection of text lines, as a first processing step, is critical in all text recognition systems. State-of-the-art methods to locate lines of text are based on handcrafted heuristics fine-tuned by the image processing community's experience. They succeed under certain constraints; for instance the background has to be roughly uniform. We propose to use more “agnostic” Machine Learning-based approaches to address text line location. The main motivation is to be able to process either damaged documents, or flows of documents with a high variety of layouts and other characteristics. A new method is presented in this work, inspired by the latest generation of optical models used for text recognition, namely Recurrent Neural Networks. As these models are sequential, a column of text lines in our application plays here the same role as a line of characters in more traditional text recognition settings. A key advantage of the proposed method over other data-driven approaches is that compiling a training dataset does not require labeling line boundaries: only the number of lines are required for each paragraph. Experimental results show that our approach gives similar or better results than traditional handcrafted approaches, with little engineering efforts and less hyper-parameter tuning.

...read moreread less

54 citations

Online Review Spam Detection by New Linguistic Features

[...]

Amir Karami, Bin Zhou

15 Mar 2015

TL;DR: Various types of linguistic features in users’ reviews such as the number of pronouns, psychological featuressuch as the affective processes, current concerns such as degree of leisure, spoken features such asdegree of assent, and punctuation such as number of colons are focused on.

...read moreread less

Abstract: The recent development of Web 2.0 has generated plentiful of user-created content. Among various types of user-generated data on the web, reviews about businesses, products, or services written by the users are becoming more and more important due to the word-of-mouth effect and their impact on influencing consumer’s purchase decisions. However, with the increasing popularity of online review websites such as TripAdvisor1 and Yelp2, malicious users start to abuse the convenience of publishing online reviews and deliberately post low quality, untrustworthy, or even fraudulent reviews. Such “spam reviews” can result in significant financial gains for organizations and individuals, and meanwhile lead to negative impact on their competitors. For example, a few recent studies have reported a new category of business which hires people to write positive reviews for some companies to attract users’ awareness and increase the profits3. Spam reviews undoubtedly reduce the quality of reviews. They may even mislead users to make wrong purchase decisions. Therefore, there is a great demand to detect spam reviews thoroughly on the web. Recently, several existing studies investigated various machine learning techniques to automatically construct spam classification models based on specific features (Ott, Choi, Cardie, & Hancock, 2011; Mihalcea & Strapparava, 2009). For example, Ott et al. (2011) investigated the lexical features such as the frequency of verbs used in the reviews. Their results indicated that those lexical features are useful for building a spam classification models for spam reviews. Despite the usefulness of many lexical features, the classification performance of existing models for spam review detection is still far from satisfactory. An interesting direction to explore is whether some other features, such as users’ sentiments and feelings, as well as many other linguistic features of reviews would be incorporated into the classification model. In this paper, we focus on various types of linguistic features in users’ reviews such as the number of pronouns, psychological features such as the affective processes, current concerns such as degree of leisure, spoken features such as degree of assent, and punctuation such as number of colons. We evaluated the spam classification performance by considering more than 40 different classification algorithms on a spam review benchmark dataset. Our experimental results verified that the combination of linguistic features with some others (e.g., the frequency of words) could improve the detection performance over the state-of-the-art method, reaching more than 93% accuracy.

...read moreread less

40 citations

Journal Article•DOI•

A practical guide to text mining with topic extraction

[...]

Andrew T. Karl, James Wisnowski, W. Heath Rushing

01 Sep 2015-Wiley Interdisciplinary Reviews: Computational Statistics

TL;DR: It is shown how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation.

...read moreread less

Abstract: Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open-ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document-term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open-source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 For further resources related to this article, please visit the WIREs website.

...read moreread less

39 citations

Journal Article•DOI•

Important Text Characteristics for Early-Grades Text Complexity.

[...]

Jill Fitzgerald¹, Jeff Elmore¹, Heather H. Koons¹, Elfrieda H. Hiebert, Kimberly Bowen¹, Eleanor E. Sanford-Moore¹, A. Jackson Stenner¹ - Show less +3 more•Institutions (1)

Durham University¹

01 Feb 2015-Journal of Educational Psychology

TL;DR: This paper explored text characteristics specifically in relation to early-grades text complexity and found that interplay among text characteristics was important to explanation of text complexity, particularly for subsets of texts.

...read moreread less

Abstract: Educational Psychology (see record 2015-17975-001). Figures 5 and 8 were inadvertently printed in greyscale through a production related error. The correct color figures appear in this record.] The Common Core set a standard for all children to read increasingly complex texts throughout schooling. The purpose of the present study was to explore text characteristics specifically in relation to early-grades text complexity. Three hundred fifty primary-grades texts were selected and digitized. Twenty-two text characteristics were identified at 4 linguistic levels, and multiple computerized operationalizations were created for each of the 22 text characteristics. A researcher-devised text-complexity outcome measure was based on teacher judgment of text complexity in the 350 texts as well as on student judgment of text complexity as gauged by their responses in a maze task for a subset of the 350 texts. Analyses were conducted using a logical analytical progression typically used in machine-learning research. Random forest regression was the primary statistical modeling technique. Nine text characteristics were most important for early-grades text complexity including word structure (decoding demand and number of syllables in words), word meaning (age of acquisition, abstractness, and word rareness), and sentence and discourse-level characteristics (intersentential complexity, phrase diversity, text density/information load, and noncompressibility). Notably, interplay among text characteristics was important to explanation of text complexity, particularly for subsets of texts.

...read moreread less

37 citations

Proceedings Article•DOI•

Automatic text summarization of Wikipedia articles

[...]

Dharmendra Hingu¹, Deep Shah¹, Sandeep S. Udmale¹•Institutions (1)

Veermata Jijabai Technological Institute¹

23 Feb 2015

TL;DR: Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly by using the citations present in the text and identifying synonyms.

...read moreread less

Abstract: The main objective of a text summarization system is to identify the most important information from the given text and present it to the end users In this paper, Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly The text is first pre-processed to tokenize the sentences and perform stemming operations We then score the sentences using the different text features Two novel approaches implemented are using the citations present in the text and identifying synonyms These features along with the traditional methods are used to score the sentences The scores are used to classify the sentence to be in the summary text or not with the help of a neural network The user can provide what percentage of the original text should be in the summary It is found that scoring the sentences based on citations gives the best results

...read moreread less

34 citations

Proceedings Article•DOI•

Text mining: Challenges and future directions

[...]

A. Akilan

18 Jun 2015

TL;DR: Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents.

...read moreread less

Abstract: In today's world, the amount of stored information has been enormously increasing day by day which is generally in the unstructured form and cannot be used for any processing to extract useful information, so several techniques such as summarization, classification, clustering, information extraction and visualization are available for the same which comes under the category of text mining. Text Mining can be defined as a technique which is used to extract interesting information or knowledge from the text documents. Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. Regarded by many as the next wave of knowledge discovery, text mining has very high commercial values.

...read moreread less

29 citations

Proceedings Article•DOI•

Text recognition from images

[...]

Pratik Madhukar Manwatkar, Shashank H. Yadav

19 Mar 2015

TL;DR: The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module.

...read moreread less

Abstract: Text recognition in images is a research area which attempts to develop a computer system with the ability to automatically read the text from images These days there is a huge demand in storing the information available in paper documents format in to a computer storage disk and then later reusing this information by searching process One simple way to store information from these paper documents in to computer system is to first scan the documents and then store them as images But to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-by-word The challenges involved in this the font characteristics of the characters in paper documents and quality of images Due to these challenges, computer is unable to recognize the characters while reading them Thus there is a need of character recognition mechanisms to perform Document Image Analysis (DIA) which transforms documents in paper format to electronic format In this paper we have discuss method for text recognition from images The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module

...read moreread less

Patent•

A method and system for sentiment classification and emotion classification

[...]

Zhaoxia Wang¹, Siow Mong Rick Goh¹, Yinping Yang¹•Institutions (1)

Agency for Science, Technology and Research¹

24 Nov 2015

TL;DR: In this article, a system and a method for classifying text messages such as social media messages into sentiment valence categories are provided, comprising a module for decomposing text messages.

...read moreread less

Abstract: A system and a method for classifying text messages, such as social media messages into sentiment valence categories are provided. The system comprising a module for decomposing text messages, a module for cleaning text messages, a module for producing feature data of text messages, and a module for classifying text messages into sentiment valence categories. The module for decomposing text messages is configured to: receive a text message, parse the text message into separate portions in response to parsing criteria based on sentence delimiters, wherein the separate portions are sentences, phrases and words, and rejoin at least some of the separate portions of the text message into sentences in response to predefined linguistic conditions.

...read moreread less

Proceedings Article•DOI•

Effective 20 Newsgroups Dataset Cleaning

[...]

Khaled Albishre¹, Mubarak Albathan¹, Yuefeng Li¹•Institutions (1)

Queensland University of Technology¹

01 Dec 2015

TL;DR: This article explored the role of text cleaning in the 20 newsgroups dataset, and report on experimental results, and found that text cleaning techniques are one of the key mechanisms in typical text mining application frameworks.

...read moreread less

Abstract: The rapid increase in the number of text documents available on the Internet has created pressure to use effective cleaning techniques. Cleaning techniques are needed for converting these documents to structured documents. Text cleaning techniques are one of the key mechanisms in typical text mining application frameworks. In this paper, we explore the role of text cleaning in the 20 newsgroups dataset, and report on experimental results.

...read moreread less

Journal Article•DOI•

Normalization of Noisy Text Data

[...]

Neelmay Desai¹, Meera Narvekar¹•Institutions (1)

Dwarkadas J. Sanghvi College of Engineering¹

01 Jan 2015-Procedia Computer Science

TL;DR: This paper target the not-in-vocabulary (NIV) words present in these sources and propose a method to identify and normalize these NIV words and replace internet slang into pure English and correct the spelling errors made to some extent.

...read moreread less

Proceedings Article•DOI•

Text normalization in code-mixed social media text

[...]

Sukanya Dutta¹, Tista Saha¹, Somnath Banerjee¹, Sudip Kumar Naskar¹•Institutions (1)

Jadavpur University¹

09 Jul 2015

TL;DR: The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla.

...read moreread less

Abstract: This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.

...read moreread less

Proceedings Article•DOI•

Detecting text based image with optical character recognition for English translation and speech using Android

[...]

Sathiapriya Ramiah¹, Tan Yu Liong¹, Manoj Jayabalan¹•Institutions (1)

Asia Pacific University of Technology & Innovation¹

01 Dec 2015

TL;DR: In this study, an Android application is developed by integrating Tesseract OCR engine, Bing translator and phones' built-in speech out technology that helps travelers who visit a foreign country to understand messages portrayed in different language.

...read moreread less

Abstract: Smartphones have been known as most commonly used electronic devices in daily life today. As hardware embedded in smartphones can perform much more task than traditional phones, the smartphones are no longer just a communication device but also considered as a powerful computing device which able to capture images, record videos, surf the internet and etc. With advancement of technology, it is possible to apply some techniques to perform text detection and translation. Therefore, an application that allows smartphones to capture an image and extract the text from it to translate into English and speech it out is no longer a dream. In this study, an Android application is developed by integrating Tesseract OCR engine, Bing translator and phones' built-in speech out technology. Final deliverable is tested by various type of target end user from a different language background and concluded that the application benefits many users. By using this app, travelers who visit a foreign country able to understand messages portrayed in different language. Visually impaired users are also able to access important message from a printed text through speech out feature.

...read moreread less

Proceedings Article•DOI•

A technical review on text recognition from images

[...]

Pratik Madhukar Manwatkar, Kavita R. Singh

01 Oct 2015

TL;DR: The objective of this review paper is to summarize the well-known methods for text recognition from images for better understanding of the reader.

...read moreread less

Abstract: Text recognition in images is an active research area which attempts to develop a computer application with the ability to automatically read the text from images. Nowadays there is a huge demand of storing the information available on paper documents in to a computer readable form for later use. One simple way to store information from these paper documents in to computer system is to first scan the documents and then store them as images. However to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-by-word. The challenges involved are: font characteristics of the characters in paper documents and quality of the images. Due to these challenges, computer is unable to recognize the characters while reading them. Thus, there is a need of character recognition mechanisms to perform document image analysis which transforms documents in paper format to electronic format. In this paper, we have reviewed and analyzed different methods for text recognition from images. The objective of this review paper is to summarize the well-known methods for better understanding of the reader.

...read moreread less

Proceedings Article•DOI•

Optimal stop word selection for text mining in critical infrastructure domain

[...]

Kasun Amarasinghe¹, Milos Manic¹, Ryan C. Hruska²•Institutions (2)

Virginia Commonwealth University¹, Idaho National Laboratory²

08 Oct 2015

TL;DR: A novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy by retaining all the stop words in the text preprocessing phase is proposed.

...read moreread less

Abstract: Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing phase. Then, an evolutionary technique is used to extract the optimal set of stop words that result in the best classification accuracy. The presented methodology was implemented on a corpus of open source news articles related to critical infrastructure hazards. The first step of mining geo-dependencies among critical infrastructures from text is text classification. In order to achieve this, article content was classified into two classes: 1) text content with geo-location information, and 2) text content without geo-location information. Classification accuracy presented methodology was compared to accuracies of four other test cases. Experimental results with 10-fold cross validation showed that the presented method yielded an increase of 1.76% or higher in True Positive (TP) rate and a 2.27% or higher increase in the True Negative (TN) rate compared to the other techniques.

...read moreread less

Proceedings Article•DOI•

Classification Scheme of Unstructured Text Document using TF-IDF and Naive Bayes Classifier

[...]

Jong-Yeol Yoo¹, Dongmin Yang•Institutions (1)

Naver Corporation¹

24 Oct 2015

TL;DR: A novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False is proposed.

...read moreread less

Abstract: Recently due to large-scale data spread in digital economy, the era of big data is coming. Through big data, unstructured text data consisting of technical text document, confidential document, false information documents are experiencing serious problems in the runoff. To prevent this, the need of art to sort and process the document consisting of text data has increased. In this paper, we propose a novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False. The proposed method is implemented using Naive Bayes document classifier and TF-IDF.

...read moreread less

Journal Article•DOI•

Emerging directions in predictive text mining

[...]

Nitin Indurkhya¹•Institutions (1)

University of New South Wales¹

01 Jul 2015-Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery

TL;DR: Six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text, and data‐centric directions are likely to influence future research in Natural Language Processing.

...read moreread less

Abstract: In recent years, Text Mining has seen a tremendous spurt of growth as data scientists focus their attention on analyzing unstructured data. The main drivers for this growth have been big data as well as complex applications where the information in the text is often combined with other kinds of information in building predictive models. These applications require highly efficient and scalable algorithms to meet the overall performance demands. In this context, six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text. Each direction has its own motivations and goals. There is some overlap of concepts because of the common themes of text and prediction. The predictive models involved are typically ones that involve meta-information or tags that could be added to the text. These tags can then be used in other text processing tasks such as information extraction. While the boundary between the fields of Text Mining and Natural Language Processing is becoming increasingly blurry, the importance of predictive models for various applications involving text means there is still substantial growth potential within the traditional sub-fields of text mining. These data-centric directions are also likely to influence future research in Natural Language Processing, especially in resource-poor languages and in multilingual texts. WIREs Data Mining Knowl Discov 2015, 5:155-164. doi: 10.1002/widm.1154

...read moreread less

Proceedings Article•DOI•

Topic based segmentation of classroom videos

[...]

Tayfun Tuna¹, Mahima Joshi¹, Varun Varghese¹, Rucha Deshpande¹, Jaspal Subhlok¹, Rakesh M. Verma¹ - Show less +2 more•Institutions (1)

University of Houston¹

21 Oct 2015

TL;DR: An automatic text-based segmentation algorithm is developed to identify topic changes and evaluated on a set of twenty-five lecture videos and the key conclusions are screen text is a better guide to discovering topic changes than speech text, the effectiveness of speech text can be improved significantly with the correction of speechText, and combining screen text and accurate speechText can improve accuracy.

...read moreread less

Abstract: Video of classroom lectures is a valuable and increasingly popular learning resource. A major weakness of the video format is the inability to quickly access the content of interest. The goal of this work is to automatically partition a lecture video into topical segments which are then presented to the user in a customized video player. The approach taken in this work is to identify topics based on text similarities across the video. The paper investigates the use of screen text extracted by Optical Character Recognition tools, as well as the speech text extracted by Automatic Speech Recognition tools. An automatic text-based segmentation algorithm is developed to identify topic changes and evaluated on a set of twenty-five lecture videos. The key conclusions are as follows. Screen text is a better guide to discovering topic changes than speech text, the effectiveness of speech text can be improved significantly with the correction of speech text, and combining screen text and accurate speech text can improve accuracy. Results are presented from surveys showing a high level of satisfaction among student users of automatically segmented videos. The paper also discusses the limits of automatic segmentation and the reasons why it is far from perfect.

...read moreread less

Proceedings Article•DOI•

Multi-oriented Text Extraction from Information Graphics

[...]

Falk Böschen¹, Ansgar Scherp•Institutions (1)

University of Kiel¹

08 Sep 2015

TL;DR: A novel processing pipeline for multi-oriented text extraction from infographics that applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition.

...read moreread less

Abstract: Existing research on analyzing information graphics assume to have a perfect text detection and extraction available. However, text extraction from information graphics is far from solved. To fill this gap, we propose a novel processing pipeline for multi-oriented text extraction from infographics. The pipeline applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition. We evaluate our method on 121 infographics extracted from an open access corpus of scientific publications. The results show that our approach is effective and significantly outperforms a state-of-the-art baseline.

...read moreread less

Proceedings Article•DOI•

Language independent rule based classification of printed & handwritten text

[...]

Tanzila Saba¹, Abdulaziz S. Almazyad², Amjad Rehman³•Institutions (3)

Prince Sultan University¹, King Saud University², Al-Yamamah Private University³

01 Dec 2015

TL;DR: In this article, a set of classification rules is derived to explicitly differentiate machine printed and handwritten entries, written in any language, and the proposed approach is independent of language, style, size, and fonts that commonly co-exist in multilingual data entry forms.

...read moreread less

Abstract: Handwriting in data entry forms/documents usually indicates user's filled information that should be treated differently from the printed text. In Arab world, these filled information are normally in English or Arabic. Secondly, classification approaches are quite different for machine printed and script. Therefore, prior to segmentation & classification, text distinction into Printed & script entries is mandatory. In this research, the dilemma of the language independent text distinction in multilingual data entry forms is addressed. Our main focus is to distinguish the machine printed text and script in multilingual data entry forms that are language independent. The proposed approach explore new statistical and structural features of text lines to classify them into separate categories. Accordingly a set of classification rules is derived to explicitly differentiate machine printed and handwritten entries, written in any language. Additional, novelty of the proposed approach is that no training/training data is required rather text is discriminated on basis of simple rules. Promising experimental results with 90 % accuracy exhibit that proposed approach is simple and robust. Finally, the scheme is independent of language, style, size, and fonts that commonly co-exist in multilingual data entry forms.

...read moreread less

Proceedings Article•DOI•

Information retrieval: A new multilingual stemmer based on a statistical approach

[...]

Said Gadri, Abdelouahab Moussaoui

25 May 2015

TL;DR: This paper proposes a new multilingual stemmer based on the extraction of word root and in which the technique of n-grams is used and validated on three languages: Arabic, French and English.

...read moreread less

Abstract: Stemming is a technique used to reduce inflected and derived words to their basic forms (stem or root). It is a very important step of pre-processing in text mining, and generally used in many areas of research such as: Natural language Processing NLP, Text Categorization TC, Text Summarizing TS, Information Retrieval IR, and other tasks in text mining. Stemming is frequently useful in text categorization to reduce the size of terms vocabulary, and in information retrieval to improve the search effectiveness and then gives us relevant results. In this paper, we propose a new multilingual stemmer based on the extraction of word root and in which we use the technique of n-grams. We validated our stemmer on three languages which are: Arabic, French and English.

...read moreread less

Proceedings Article•DOI•

Overwriting repetition and crossing-out detection in online handwritten text

[...]

Nilanjana Bhattacharya¹, Volkmar Frinken², Umapada Pal³, Partha Pratim Roy⁴•Institutions (4)

Bose Institute¹, Kyushu University², Indian Statistical Institute³, Indian Institute of Technology Roorkee⁴

01 Nov 2015

TL;DR: This work proposes to use different density-based features to distinguish between "relevant" and "unwanted" (or noisy) parts of writing, and uses a 2-class HMM based classifier to get encouraging detection rate of unwanted regions from online handwritten text.

...read moreread less

Abstract: Noise detection in online handwritten text is an important task for data acquisition. Noise occurs in online handwritten text in various ways. For example, crossing out the previously written text due to misspelling, repeated writing of the same stroke several times following a slightly different trajectory, simply writing corrections over other text are very common. Detection of these unwanted regions is a crucial pre-processing step in automatic text recognition. Currently detection and removal/correction of such regions are often done manually after collecting the data. Particularly for large databases, this can turn into a tedious and costly procedure. Consequently, in this work, we focus on noise detection for database creation. We propose to use different density-based features to distinguish between "relevant" and "unwanted" (or noisy) parts of writing. Using a 2-class HMM based classifier we get encouraging detection rate of unwanted regions from online handwritten text.

...read moreread less

Posted Content•

Ontology Based SMS Controller for Smart Phones

[...]

Mohammed Balubaid, Umar Manzoor

04 Sep 2015-arXiv: Artificial Intelligence

TL;DR: In this article, an Ontology based SMS Controller is proposed to analyze the text message and classify it using ontology as legitimate or spam, the proposed system has been tested on different scenarios and experimental results shows that the proposed solution is effective both in terms of efficiency and time.

...read moreread less

Abstract: Text analysis includes lexical analysis of the text and has been widely studied and used in diverse applications. In the last decade, researchers have proposed many efficient solutions to analyze / classify large text dataset, however, analysis / classification of short text is still a challenge because 1) the data is very sparse 2) It contains noise words and 3) It is difficult to understand the syntactical structure of the text. Short Messaging Service (SMS) is a text messaging service for mobile/smart phone and this service is frequently used by all mobile users. Because of the popularity of SMS service, marketing companies nowadays are also using this service for direct marketing also known as SMS this http URL this paper, we have proposed Ontology based SMS Controller which analyze the text message and classify it using ontology aslegitimate or spam. The proposed system has been tested on different scenarios and experimental results shows that the proposed solution is effective both in terms of efficiency and time.

...read moreread less

Proceedings Article•DOI•

Applications of Text Detection and its Challenges: A Review

[...]

M. P. Nevetha¹, A. Baskar¹•Institutions (1)

Amrita Vishwa Vidyapeetham¹

10 Aug 2015

TL;DR: The rising need for automation of systems has effected the development of text detection and recognition from images to a large extent, and this paper attempts to answer questions in chosen scenarios.

...read moreread less

Abstract: The rising need for automation of systems has effected the development of text detection and recognition from images to a large extent. Text recognition has a wide range of applications, each with scenario dependent challenges and complications. How can these challenges be mitigated? What image processing techniques can be applied to make the text in the image machine readable? How can text be localized and separated from non textual information? How can the text image be converted to digital text format? This paper attempts to answer these questions in chosen scenarios. The types of document images that we have surveyed include general documents such as newspapers, books and magazines, forms, scientific documents, unconstrained documents such as maps, architectural and engineering drawings, and scene images with textual information.

...read moreread less

Patent•

Domain-specific unstructured text retrieval

[...]

Achraf Chalabi¹, Eslam Kamal Abdel-Aal Abdel-Reheem¹, Sayed Hassan Sayed Abdelaziz¹, Yuval Yehezkel Marton¹, Michel Naim Naguib Gerguis¹ - Show less +1 more•Institutions (1)

Microsoft¹

28 Sep 2015

TL;DR: In this paper, a first classifier is trained using features of the training data and then a second classifier was trained using the similar examples, which were used to label the additional unstructured text for domain relevance.

...read moreread less

Abstract: Retrieving from the Internet unstructured text related to a specified domain is described. Training data is accessed; the training data comprises unstructured text related to the specified domain. A first classifier is trained using features of the training data. It is used to classify unstructured text having plurality of features, to obtain unstructured text examples related to the domain. The unstructured text examples are used to retrieve from the Internet similar examples which do not have at least some of the plurality of features. Optionally, a second classifier is trained using the similar examples. Additional unstructured text is retrieved from the Internet and the second classifier is used to label the additional unstructured text for domain relevance.

...read moreread less

Proceedings Article•DOI•

The research of text preprocessing effect on text documents classification efficiency

[...]

Andrew Kurbatow¹•Institutions (1)

Pedagogical University¹

03 Dec 2015

TL;DR: There are several requirements to the preprocessing of the classified texts and importance of these requirements have been analysed.

...read moreread less

Abstract: There are several requirements to the preprocessing of the classified texts. Within the frame of this work importance of these requirements have been analysed.

...read moreread less

An elaboration of text categorization and automatic text classification through mathematical and graphical modelling

[...]

Ahmed Faraz

01 Jan 2015

TL;DR: This paper has proposed the mathematical notation and graphical models for Text Mining, Text Categorization and Automatic Text Classification to get in depth understanding of these techniques and concepts and will shorten the response time of text and information retrieval.

...read moreread less

Abstract: As the time goes on and on, digitization of text has been increasing enormously and the need to organize, categorize and classify text has become indispensable. Disorganization and very little categorization and classification of text may result in slower response time of text or information retrieval. Therefore it is very important and essential to organize, categorize and classify texts and digitized documents according to definitions proposed by text mining experts and computer scientists. Work has been done on Text Mining, Text Categorization and Automatic Text Classification by computer and information scientists, but obviously a lot of space for novel research in this domain is available. In this paper we have proposed the mathematical notation and graphical models for Text Mining, Text Categorization and Automatic Text Classification to get in depth understanding of these techniques and concepts. Introduction and proposal of mathematical and graphical models for Text Mining, Text Categorization and Automatic Text Classification will shorten the response time of text and information retrieval. Also the performance of web search engines can be improved so much by employing these mathematical and graphical models.

...read moreread less