scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2015"


Proceedings ArticleDOI
23 Aug 2015
TL;DR: Initial results on the use of Multi-Dimensional Long-Short Term Memory Recurrent Neural Networks (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters are presented.
Abstract: We present initial results on the use of Multi-Dimensional Long-Short Term Memory Recurrent Neural Networks (MDLSTM-RNN) in recognizing lines of handwritten Chinese text without explicit segmentation of the characters. In fact, most of Chinese text recognizers in the literature perform a pre-segmentation of text image into characters. This can be a drawback, as explicit segmentation is an extra step before recognizing the text, and the errors made at this stage have direct impact on the performance of the whole system. MDLSTM-RNN is now a state-of-the-art technology that provides the best performance on languages with Latin and Arabic characters, hence we propose to apply RNN on Chinese text recognition. Our results on the data from the Task 4 in ICDAR 2013 competition for handwritten Chinese recognition are comparable in performance with the best reported systems.

146 citations


Journal Article
TL;DR: In this article, the authors focused on classifying Arabic text documents and used a maximum entropy method to classify Arabic documents, they experimented their approach using real data, then compared the results with other existing systems.
Abstract: In organizations, a large amount of information exists in text documents. Therefore, it is important to use text mining to discover knowledge from these unstructured data. Automatic text classification considered as one of important applications in text mining. It is the process of assigning a text document to one or more predefined categories based on their content. This paper focus on classifying Arabic text documents. Arabic language is highly inflectional and derivational language which makes text mining a complex task. In our approach, we first preprocessed data using natural language processing techniques such as tokenizing, stemming and part-of-speech. Then, we used maximum entropy method to classify Arabic documents. We experimented our approach using real data, then we compared the results with other existing systems.

140 citations


Proceedings ArticleDOI
13 Apr 2015
TL;DR: This work uses lexical-semantic knowledge provided by a well-known semantic network for short text understanding and uses knowledge-intensive approaches that focus on semantics in all these tasks.
Abstract: Understanding short texts is crucial to many applications, but challenges abound. First, short texts do not always observe the syntax of a written language. As a result, traditional natural language processing methods cannot be easily applied. Second, short texts usually do not contain sufficient statistical signals to support many state-of-the-art approaches for text processing such as topic modeling. Third, short texts are usually more ambiguous. We argue that knowledge is needed in order to better understand short texts. In this work, we use lexical-semantic knowledge provided by a well-known semantic network for short text understanding. Our knowledge-intensive approach disrupts traditional methods for tasks such as text segmentation, part-of-speech tagging, and concept labeling, in the sense that we focus on semantics in all these tasks. We conduct a comprehensive performance evaluation on real-life data. The results show that knowledge is indispensable for short text understanding, and our knowledge-intensive approaches are effective in harvesting semantics of short texts.

138 citations


Journal ArticleDOI
TL;DR: Substantial experiments show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods in text mining.
Abstract: It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing user preferences because of large scale terms and data patterns. Most existing popular text mining and classification methods have adopted term-based approaches. However, they have all suffered from the problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-based methods should perform better than term-based ones in describing user preferences; yet, how to effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both positive and negative patterns in text documents as higher level features and deploys them over low-level features (terms). It also classifies terms into categories and updates term weights based on their specificity and their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods and the pattern based methods.

67 citations


Proceedings ArticleDOI
23 Aug 2015
TL;DR: A new method to use more “agnostic” Machine Learning-based approaches to address text line location, inspired by the latest generation of optical models used for text recognition, namely Recurrent Neural Networks.
Abstract: The detection of text lines, as a first processing step, is critical in all text recognition systems. State-of-the-art methods to locate lines of text are based on handcrafted heuristics fine-tuned by the image processing community's experience. They succeed under certain constraints; for instance the background has to be roughly uniform. We propose to use more “agnostic” Machine Learning-based approaches to address text line location. The main motivation is to be able to process either damaged documents, or flows of documents with a high variety of layouts and other characteristics. A new method is presented in this work, inspired by the latest generation of optical models used for text recognition, namely Recurrent Neural Networks. As these models are sequential, a column of text lines in our application plays here the same role as a line of characters in more traditional text recognition settings. A key advantage of the proposed method over other data-driven approaches is that compiling a training dataset does not require labeling line boundaries: only the number of lines are required for each paragraph. Experimental results show that our approach gives similar or better results than traditional handcrafted approaches, with little engineering efforts and less hyper-parameter tuning.

54 citations


15 Mar 2015
TL;DR: Various types of linguistic features in users’ reviews such as the number of pronouns, psychological featuressuch as the affective processes, current concerns such as degree of leisure, spoken features such asdegree of assent, and punctuation such as number of colons are focused on.
Abstract: The recent development of Web 2.0 has generated plentiful of user-created content. Among various types of user-generated data on the web, reviews about businesses, products, or services written by the users are becoming more and more important due to the word-of-mouth effect and their impact on influencing consumer’s purchase decisions. However, with the increasing popularity of online review websites such as TripAdvisor1 and Yelp2, malicious users start to abuse the convenience of publishing online reviews and deliberately post low quality, untrustworthy, or even fraudulent reviews. Such “spam reviews” can result in significant financial gains for organizations and individuals, and meanwhile lead to negative impact on their competitors. For example, a few recent studies have reported a new category of business which hires people to write positive reviews for some companies to attract users’ awareness and increase the profits3. Spam reviews undoubtedly reduce the quality of reviews. They may even mislead users to make wrong purchase decisions. Therefore, there is a great demand to detect spam reviews thoroughly on the web. Recently, several existing studies investigated various machine learning techniques to automatically construct spam classification models based on specific features (Ott, Choi, Cardie, & Hancock, 2011; Mihalcea & Strapparava, 2009). For example, Ott et al. (2011) investigated the lexical features such as the frequency of verbs used in the reviews. Their results indicated that those lexical features are useful for building a spam classification models for spam reviews. Despite the usefulness of many lexical features, the classification performance of existing models for spam review detection is still far from satisfactory. An interesting direction to explore is whether some other features, such as users’ sentiments and feelings, as well as many other linguistic features of reviews would be incorporated into the classification model. In this paper, we focus on various types of linguistic features in users’ reviews such as the number of pronouns, psychological features such as the affective processes, current concerns such as degree of leisure, spoken features such as degree of assent, and punctuation such as number of colons. We evaluated the spam classification performance by considering more than 40 different classification algorithms on a spam review benchmark dataset. Our experimental results verified that the combination of linguistic features with some others (e.g., the frequency of words) could improve the detection performance over the state-of-the-art method, reaching more than 93% accuracy.

40 citations


Journal ArticleDOI
TL;DR: It is shown how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation.
Abstract: Text analytics continue to proliferate as mass volumes of unstructured but highly useful data are generated at unbounded rates. Vector space models for text data—in which documents are represented by rows and words by columns—provide a translation of this unstructured data into a format that may be analyzed with statistical and machine learning techniques. This approach gives excellent results in revealing common themes, clustering documents, clustering words, and in translating unstructured text fields (such as an open-ended survey response) to usable input variables for predictive modeling. After discussing the collection and processing of text, we explore properties and transformations of the document-term matrix (DTM). We show how the singular value decomposition may be used to drastically reduce the size of the document space while also setting the stage for automatic topic extraction, courtesy of the varimax rotation. This latent semantic analysis (LSA) approach produces factors that are compatible with graphical exploration and advanced analytics. We also explore Latent Dirichlet Allocation for topic analysis. We reference published R packages to implement the methods and conclude with a summary of other popular open-source and commercial software packages. WIREs Comput Stat 2015, 7:326–340. doi: 10.1002/wics.1361 For further resources related to this article, please visit the WIREs website.

39 citations


Journal ArticleDOI
TL;DR: This paper explored text characteristics specifically in relation to early-grades text complexity and found that interplay among text characteristics was important to explanation of text complexity, particularly for subsets of texts.
Abstract: Educational Psychology (see record 2015-17975-001). Figures 5 and 8 were inadvertently printed in greyscale through a production related error. The correct color figures appear in this record.] The Common Core set a standard for all children to read increasingly complex texts throughout schooling. The purpose of the present study was to explore text characteristics specifically in relation to early-grades text complexity. Three hundred fifty primary-grades texts were selected and digitized. Twenty-two text characteristics were identified at 4 linguistic levels, and multiple computerized operationalizations were created for each of the 22 text characteristics. A researcher-devised text-complexity outcome measure was based on teacher judgment of text complexity in the 350 texts as well as on student judgment of text complexity as gauged by their responses in a maze task for a subset of the 350 texts. Analyses were conducted using a logical analytical progression typically used in machine-learning research. Random forest regression was the primary statistical modeling technique. Nine text characteristics were most important for early-grades text complexity including word structure (decoding demand and number of syllables in words), word meaning (age of acquisition, abstractness, and word rareness), and sentence and discourse-level characteristics (intersentential complexity, phrase diversity, text density/information load, and noncompressibility). Notably, interplay among text characteristics was important to explanation of text complexity, particularly for subsets of texts.

37 citations


Proceedings ArticleDOI
23 Feb 2015
TL;DR: Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly by using the citations present in the text and identifying synonyms.
Abstract: The main objective of a text summarization system is to identify the most important information from the given text and present it to the end users In this paper, Wikipedia articles are given as input to system and extractive text summarization is presented by identifying text features and scoring the sentences accordingly The text is first pre-processed to tokenize the sentences and perform stemming operations We then score the sentences using the different text features Two novel approaches implemented are using the citations present in the text and identifying synonyms These features along with the traditional methods are used to score the sentences The scores are used to classify the sentence to be in the summary text or not with the help of a neural network The user can provide what percentage of the original text should be in the summary It is found that scoring the sentences based on citations gives the best results

34 citations


Proceedings ArticleDOI
18 Jun 2015
TL;DR: Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents.
Abstract: In today's world, the amount of stored information has been enormously increasing day by day which is generally in the unstructured form and cannot be used for any processing to extract useful information, so several techniques such as summarization, classification, clustering, information extraction and visualization are available for the same which comes under the category of text mining. Text Mining can be defined as a technique which is used to extract interesting information or knowledge from the text documents. Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. Regarded by many as the next wave of knowledge discovery, text mining has very high commercial values.

29 citations


Proceedings ArticleDOI
19 Mar 2015
TL;DR: The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module.
Abstract: Text recognition in images is a research area which attempts to develop a computer system with the ability to automatically read the text from images These days there is a huge demand in storing the information available in paper documents format in to a computer storage disk and then later reusing this information by searching process One simple way to store information from these paper documents in to computer system is to first scan the documents and then store them as images But to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-by-word The challenges involved in this the font characteristics of the characters in paper documents and quality of images Due to these challenges, computer is unable to recognize the characters while reading them Thus there is a need of character recognition mechanisms to perform Document Image Analysis (DIA) which transforms documents in paper format to electronic format In this paper we have discuss method for text recognition from images The objective of this paper is to recognition of text from image for better understanding of the reader by using particular sequence of different processing module

Patent
24 Nov 2015
TL;DR: In this article, a system and a method for classifying text messages such as social media messages into sentiment valence categories are provided, comprising a module for decomposing text messages.
Abstract: A system and a method for classifying text messages, such as social media messages into sentiment valence categories are provided. The system comprising a module for decomposing text messages, a module for cleaning text messages, a module for producing feature data of text messages, and a module for classifying text messages into sentiment valence categories. The module for decomposing text messages is configured to: receive a text message, parse the text message into separate portions in response to parsing criteria based on sentence delimiters, wherein the separate portions are sentences, phrases and words, and rejoin at least some of the separate portions of the text message into sentences in response to predefined linguistic conditions.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: This article explored the role of text cleaning in the 20 newsgroups dataset, and report on experimental results, and found that text cleaning techniques are one of the key mechanisms in typical text mining application frameworks.
Abstract: The rapid increase in the number of text documents available on the Internet has created pressure to use effective cleaning techniques. Cleaning techniques are needed for converting these documents to structured documents. Text cleaning techniques are one of the key mechanisms in typical text mining application frameworks. In this paper, we explore the role of text cleaning in the 20 newsgroups dataset, and report on experimental results.

Journal ArticleDOI
TL;DR: This paper target the not-in-vocabulary (NIV) words present in these sources and propose a method to identify and normalize these NIV words and replace internet slang into pure English and correct the spelling errors made to some extent.

Proceedings ArticleDOI
09 Jul 2015
TL;DR: The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla.
Abstract: This paper addresses the problem of text normalization, an often overlooked problem in natural language processing, in code-mixed social media text. The objective of the work presented here is to correct English spelling errors in code-mixed social media text that contains English words as well as Romanized transliteration of words from another language, in this case Bangla. The targeted research problem also entails solving another problem, that of word-level language identification in code-mixed social media text. We employ a CRF based machine learning approach followed by post-processing heuristics for the word-level language identification task. For spelling correction, we used the noisy channel model of spelling correction. In addition, the spell checker model presented here tackles wordplay, contracted words and phonetic variations. Overall, the word-level language identification achieved 90.5% accuracy and the spell checker achieved 69.43% accuracy on the detected English words.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: In this study, an Android application is developed by integrating Tesseract OCR engine, Bing translator and phones' built-in speech out technology that helps travelers who visit a foreign country to understand messages portrayed in different language.
Abstract: Smartphones have been known as most commonly used electronic devices in daily life today. As hardware embedded in smartphones can perform much more task than traditional phones, the smartphones are no longer just a communication device but also considered as a powerful computing device which able to capture images, record videos, surf the internet and etc. With advancement of technology, it is possible to apply some techniques to perform text detection and translation. Therefore, an application that allows smartphones to capture an image and extract the text from it to translate into English and speech it out is no longer a dream. In this study, an Android application is developed by integrating Tesseract OCR engine, Bing translator and phones' built-in speech out technology. Final deliverable is tested by various type of target end user from a different language background and concluded that the application benefits many users. By using this app, travelers who visit a foreign country able to understand messages portrayed in different language. Visually impaired users are also able to access important message from a printed text through speech out feature.

Proceedings ArticleDOI
01 Oct 2015
TL;DR: The objective of this review paper is to summarize the well-known methods for text recognition from images for better understanding of the reader.
Abstract: Text recognition in images is an active research area which attempts to develop a computer application with the ability to automatically read the text from images. Nowadays there is a huge demand of storing the information available on paper documents in to a computer readable form for later use. One simple way to store information from these paper documents in to computer system is to first scan the documents and then store them as images. However to reuse this information it is very difficult to read the individual contents and searching the contents form these documents line-by-line and word-by-word. The challenges involved are: font characteristics of the characters in paper documents and quality of the images. Due to these challenges, computer is unable to recognize the characters while reading them. Thus, there is a need of character recognition mechanisms to perform document image analysis which transforms documents in paper format to electronic format. In this paper, we have reviewed and analyzed different methods for text recognition from images. The objective of this review paper is to summarize the well-known methods for better understanding of the reader.

Proceedings ArticleDOI
08 Oct 2015
TL;DR: A novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy by retaining all the stop words in the text preprocessing phase is proposed.
Abstract: Eliminating all stop words from the feature space is a standard practice of preprocessing in text mining, regardless of the domain which it is applied to. However, this may result in loss of important information, which adversely affects the accuracy of the text mining algorithm. Therefore, this paper proposes a novel methodology for selecting the optimal set of domain specific stop words for improved text mining accuracy. First, the presented methodology retains all the stop words in the text preprocessing phase. Then, an evolutionary technique is used to extract the optimal set of stop words that result in the best classification accuracy. The presented methodology was implemented on a corpus of open source news articles related to critical infrastructure hazards. The first step of mining geo-dependencies among critical infrastructures from text is text classification. In order to achieve this, article content was classified into two classes: 1) text content with geo-location information, and 2) text content without geo-location information. Classification accuracy presented methodology was compared to accuracies of four other test cases. Experimental results with 10-fold cross validation showed that the presented method yielded an increase of 1.76% or higher in True Positive (TP) rate and a 2.27% or higher increase in the True Negative (TN) rate compared to the other techniques.

Proceedings ArticleDOI
24 Oct 2015
TL;DR: A novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False is proposed.
Abstract: Recently due to large-scale data spread in digital economy, the era of big data is coming. Through big data, unstructured text data consisting of technical text document, confidential document, false information documents are experiencing serious problems in the runoff. To prevent this, the need of art to sort and process the document consisting of text data has increased. In this paper, we propose a novel text classification scheme which learns some data sets and correctly classifies unstructured text data into two different categories, True and False. The proposed method is implemented using Naive Bayes document classifier and TF-IDF.

Journal ArticleDOI
TL;DR: Six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text, and data‐centric directions are likely to influence future research in Natural Language Processing.
Abstract: In recent years, Text Mining has seen a tremendous spurt of growth as data scientists focus their attention on analyzing unstructured data. The main drivers for this growth have been big data as well as complex applications where the information in the text is often combined with other kinds of information in building predictive models. These applications require highly efficient and scalable algorithms to meet the overall performance demands. In this context, six main directions are identified where research in text mining is heading: Deep Learning, Topic Models, Graphical Modeling, Summarization, Sentiment Analysis, Learning from Unlabeled Text. Each direction has its own motivations and goals. There is some overlap of concepts because of the common themes of text and prediction. The predictive models involved are typically ones that involve meta-information or tags that could be added to the text. These tags can then be used in other text processing tasks such as information extraction. While the boundary between the fields of Text Mining and Natural Language Processing is becoming increasingly blurry, the importance of predictive models for various applications involving text means there is still substantial growth potential within the traditional sub-fields of text mining. These data-centric directions are also likely to influence future research in Natural Language Processing, especially in resource-poor languages and in multilingual texts. WIREs Data Mining Knowl Discov 2015, 5:155-164. doi: 10.1002/widm.1154

Proceedings ArticleDOI
21 Oct 2015
TL;DR: An automatic text-based segmentation algorithm is developed to identify topic changes and evaluated on a set of twenty-five lecture videos and the key conclusions are screen text is a better guide to discovering topic changes than speech text, the effectiveness of speech text can be improved significantly with the correction of speechText, and combining screen text and accurate speechText can improve accuracy.
Abstract: Video of classroom lectures is a valuable and increasingly popular learning resource. A major weakness of the video format is the inability to quickly access the content of interest. The goal of this work is to automatically partition a lecture video into topical segments which are then presented to the user in a customized video player. The approach taken in this work is to identify topics based on text similarities across the video. The paper investigates the use of screen text extracted by Optical Character Recognition tools, as well as the speech text extracted by Automatic Speech Recognition tools. An automatic text-based segmentation algorithm is developed to identify topic changes and evaluated on a set of twenty-five lecture videos. The key conclusions are as follows. Screen text is a better guide to discovering topic changes than speech text, the effectiveness of speech text can be improved significantly with the correction of speech text, and combining screen text and accurate speech text can improve accuracy. Results are presented from surveys showing a high level of satisfaction among student users of automatically segmented videos. The paper also discusses the limits of automatic segmentation and the reasons why it is far from perfect.

Proceedings ArticleDOI
08 Sep 2015
TL;DR: A novel processing pipeline for multi-oriented text extraction from infographics that applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition.
Abstract: Existing research on analyzing information graphics assume to have a perfect text detection and extraction available. However, text extraction from information graphics is far from solved. To fill this gap, we propose a novel processing pipeline for multi-oriented text extraction from infographics. The pipeline applies a combination of data mining and computer vision techniques to identify text elements, cluster them into text lines, compute their orientation, and uses a state-of-the-art open source OCR engine to perform the text recognition. We evaluate our method on 121 infographics extracted from an open access corpus of scientific publications. The results show that our approach is effective and significantly outperforms a state-of-the-art baseline.

Proceedings ArticleDOI
01 Dec 2015
TL;DR: In this article, a set of classification rules is derived to explicitly differentiate machine printed and handwritten entries, written in any language, and the proposed approach is independent of language, style, size, and fonts that commonly co-exist in multilingual data entry forms.
Abstract: Handwriting in data entry forms/documents usually indicates user's filled information that should be treated differently from the printed text. In Arab world, these filled information are normally in English or Arabic. Secondly, classification approaches are quite different for machine printed and script. Therefore, prior to segmentation & classification, text distinction into Printed & script entries is mandatory. In this research, the dilemma of the language independent text distinction in multilingual data entry forms is addressed. Our main focus is to distinguish the machine printed text and script in multilingual data entry forms that are language independent. The proposed approach explore new statistical and structural features of text lines to classify them into separate categories. Accordingly a set of classification rules is derived to explicitly differentiate machine printed and handwritten entries, written in any language. Additional, novelty of the proposed approach is that no training/training data is required rather text is discriminated on basis of simple rules. Promising experimental results with 90 % accuracy exhibit that proposed approach is simple and robust. Finally, the scheme is independent of language, style, size, and fonts that commonly co-exist in multilingual data entry forms.

Proceedings ArticleDOI
25 May 2015
TL;DR: This paper proposes a new multilingual stemmer based on the extraction of word root and in which the technique of n-grams is used and validated on three languages: Arabic, French and English.
Abstract: Stemming is a technique used to reduce inflected and derived words to their basic forms (stem or root). It is a very important step of pre-processing in text mining, and generally used in many areas of research such as: Natural language Processing NLP, Text Categorization TC, Text Summarizing TS, Information Retrieval IR, and other tasks in text mining. Stemming is frequently useful in text categorization to reduce the size of terms vocabulary, and in information retrieval to improve the search effectiveness and then gives us relevant results. In this paper, we propose a new multilingual stemmer based on the extraction of word root and in which we use the technique of n-grams. We validated our stemmer on three languages which are: Arabic, French and English.

Proceedings ArticleDOI
01 Nov 2015
TL;DR: This work proposes to use different density-based features to distinguish between "relevant" and "unwanted" (or noisy) parts of writing, and uses a 2-class HMM based classifier to get encouraging detection rate of unwanted regions from online handwritten text.
Abstract: Noise detection in online handwritten text is an important task for data acquisition. Noise occurs in online handwritten text in various ways. For example, crossing out the previously written text due to misspelling, repeated writing of the same stroke several times following a slightly different trajectory, simply writing corrections over other text are very common. Detection of these unwanted regions is a crucial pre-processing step in automatic text recognition. Currently detection and removal/correction of such regions are often done manually after collecting the data. Particularly for large databases, this can turn into a tedious and costly procedure. Consequently, in this work, we focus on noise detection for database creation. We propose to use different density-based features to distinguish between "relevant" and "unwanted" (or noisy) parts of writing. Using a 2-class HMM based classifier we get encouraging detection rate of unwanted regions from online handwritten text.

Posted Content
TL;DR: In this article, an Ontology based SMS Controller is proposed to analyze the text message and classify it using ontology as legitimate or spam, the proposed system has been tested on different scenarios and experimental results shows that the proposed solution is effective both in terms of efficiency and time.
Abstract: Text analysis includes lexical analysis of the text and has been widely studied and used in diverse applications. In the last decade, researchers have proposed many efficient solutions to analyze / classify large text dataset, however, analysis / classification of short text is still a challenge because 1) the data is very sparse 2) It contains noise words and 3) It is difficult to understand the syntactical structure of the text. Short Messaging Service (SMS) is a text messaging service for mobile/smart phone and this service is frequently used by all mobile users. Because of the popularity of SMS service, marketing companies nowadays are also using this service for direct marketing also known as SMS this http URL this paper, we have proposed Ontology based SMS Controller which analyze the text message and classify it using ontology aslegitimate or spam. The proposed system has been tested on different scenarios and experimental results shows that the proposed solution is effective both in terms of efficiency and time.

Proceedings ArticleDOI
10 Aug 2015
TL;DR: The rising need for automation of systems has effected the development of text detection and recognition from images to a large extent, and this paper attempts to answer questions in chosen scenarios.
Abstract: The rising need for automation of systems has effected the development of text detection and recognition from images to a large extent. Text recognition has a wide range of applications, each with scenario dependent challenges and complications. How can these challenges be mitigated? What image processing techniques can be applied to make the text in the image machine readable? How can text be localized and separated from non textual information? How can the text image be converted to digital text format? This paper attempts to answer these questions in chosen scenarios. The types of document images that we have surveyed include general documents such as newspapers, books and magazines, forms, scientific documents, unconstrained documents such as maps, architectural and engineering drawings, and scene images with textual information.

Patent
28 Sep 2015
TL;DR: In this paper, a first classifier is trained using features of the training data and then a second classifier was trained using the similar examples, which were used to label the additional unstructured text for domain relevance.
Abstract: Retrieving from the Internet unstructured text related to a specified domain is described. Training data is accessed; the training data comprises unstructured text related to the specified domain. A first classifier is trained using features of the training data. It is used to classify unstructured text having plurality of features, to obtain unstructured text examples related to the domain. The unstructured text examples are used to retrieve from the Internet similar examples which do not have at least some of the plurality of features. Optionally, a second classifier is trained using the similar examples. Additional unstructured text is retrieved from the Internet and the second classifier is used to label the additional unstructured text for domain relevance.

Proceedings ArticleDOI
03 Dec 2015
TL;DR: There are several requirements to the preprocessing of the classified texts and importance of these requirements have been analysed.
Abstract: There are several requirements to the preprocessing of the classified texts. Within the frame of this work importance of these requirements have been analysed.

01 Jan 2015
TL;DR: This paper has proposed the mathematical notation and graphical models for Text Mining, Text Categorization and Automatic Text Classification to get in depth understanding of these techniques and concepts and will shorten the response time of text and information retrieval.
Abstract: As the time goes on and on, digitization of text has been increasing enormously and the need to organize, categorize and classify text has become indispensable. Disorganization and very little categorization and classification of text may result in slower response time of text or information retrieval. Therefore it is very important and essential to organize, categorize and classify texts and digitized documents according to definitions proposed by text mining experts and computer scientists. Work has been done on Text Mining, Text Categorization and Automatic Text Classification by computer and information scientists, but obviously a lot of space for novel research in this domain is available. In this paper we have proposed the mathematical notation and graphical models for Text Mining, Text Categorization and Automatic Text Classification to get in depth understanding of these techniques and concepts. Introduction and proposal of mathematical and graphical models for Text Mining, Text Categorization and Automatic Text Classification will shorten the response time of text and information retrieval. Also the performance of web search engines can be improved so much by employing these mathematical and graphical models.