scispace - formally typeset
Search or ask a question

Showing papers on "Noisy text analytics published in 2022"


Proceedings ArticleDOI
01 Jul 2022
TL;DR: In this article , a multi-level secret sharing scheme (MSSS) based approach has been proposed to ensure the security of the most informative and sensitive text data extracted through text preprocessing steps from the text document.
Abstract: Text data pre-processing has become essential in research fields like Information Retrieval (IR), Natural Language Processing (NLP), and text mining. It extracts valuable and nontrivial information from unstructured text data. Text data plays an essential role in the digital era as various types of information and data are being generated in an unstructured way. In this work, Multi-level Secret Sharing Scheme (MSSS) based novel approach has been proposed to ensure the security of the most informative and sensitive text data extracted through text preprocessing steps from the text document. The proposed model employs tokenization, stop word removal, and stemming as preprocessing techniques, which help examine text documents. Three types (electronic mail, whats-app messages, and text messages) of datasets have been contributed to this work for performing the experiments, where the model achieves 100% correlation between original and reconstructed text.

3 citations


Proceedings ArticleDOI
23 May 2022
TL;DR: A text merging algorithm is developed, which can effectively merge the word-level text obtained from the text recognition module to construct line-level and paragraph-level texts for enhancing semantic context, which is crucial to visual text understanding.
Abstract: Text visual question answer (TextVQA) is an important task of visual text understanding, which requires to understand the text generated by text recognition module and provide correct answers to specific questions. Recent works of TextVQA have tried to combine text recognition and multi-modal learning. However, due to the lack of effective preprocessing of text recognition output, existing approaches suffer from serious contextual information missing, which leads to unsatisfactory performance. In this work, we propose a Multi-Modal Learning framework with Text Merging (MML&TM in short) for TextVQA, where we develop a text merging (TM) algorithm, which can effectively merge the word-level text obtained from the text recognition module to construct line-level and paragraph-level texts for enhancing semantic context, which is crucial to visual text understanding. The TM module can be easily incorporated into the multi-modal learning framework to generate more comprehensive answers for TextVQA. We evaluate our method on a public dataset ST-VQA. Experimental results show that our TM algorithm can obtain complete semantic information, which subsequently helps MML&TM generate better answers for TextVQA.

2 citations


Journal ArticleDOI
TL;DR: In this article , Li et al. improve classic bottom-up text detection frameworks by fusing the visual-relational features of text with two effective false positive/negative suppression (FPNS) mechanisms and developing a new shape-approximation strategy.
Abstract: One trend in the latest bottom-up approaches for arbitrary-shape scene text detection is to determine the links between text segments using Graph Convolutional Networks (GCNs). However, the performance of these bottom-up methods is still inferior to that of state-of-the-art top-down methods even with the help of GCNs. We argue that a cause of this is that bottom-up methods fail to make proper use of visual-relational features, which results in accumulated false detection, as well as the error-prone route-finding used for grouping text segments. In this paper, we improve classic bottom-up text detection frameworks by fusing the visual-relational features of text with two effective false positive/negative suppression (FPNS) mechanisms and developing a new shape-approximation strategy. First, dense overlapping text segments depicting the ‘`characterness’' and ‘`streamline’' properties of text are constructed and used in weakly supervised node classification to filter the falsely detected text segments. Then, relational features and visual features of text segments are fused with a novel Location-Aware Transfer (LAT) module and Fuse Decoding (FD) module to jointly rectify the detected text segments. Finally, a novel multiple-text-map-aware contour-approximation strategy is developed based on the rectified text segments, instead of the error-prone route-finding process, to generate the final contour of the detected text. Experiments conducted on five benchmark datasets demonstrate that our method outperforms the state-of-the-art performance when embedded in a classic text detection framework, which revitalizes the strengths of bottom-up methods.

1 citations


Proceedings ArticleDOI
20 May 2022
TL;DR: This paper first introduces the development process of NLP, and then studies the NLP model architectures, and builds LSTM and text CNN models on small sample text data for training, completes text classification, and analyzes the experimental results.
Abstract: With the advent of the Internet era, more and more users acquire and send information on the Internet. The data exchanged between users, the data users interact with the network, and the data users send on the platform (text documents, videos, and images) are all growing exponentially. Among them, the amount of information in text documents has grown significantly, and it is no longer possible to classify and organize them manually. The birth of text classification technology can just rectify this information disorder phenomenon. It is particularly important to organize this information quickly and accurately, to explore the value of these text information, and to filter out useless information in the era of “big data”. At the same time, many natural language processing technologies are also based on text classification technology for data mining. Only with the basic processing of text classification can information be more convenient to manage, retrieve, filter and understand. This paper first introduces the development process of NLP, and then studies the NLP model architectures such as word2vec, LSTM, text CNN, etc., and then builds LSTM and text CNN models on small sample text data for training, completes text classification, and finally analyzes the experimental results. Through the experimental results, the loss value, accuracy rate and mean absolute error of the model are calculated to find the best text topic classification model.

1 citations


Journal ArticleDOI
TL;DR: Text data mining is the process of extracting value from text data as discussed by the authors , which is one way of achieving artificial intelligence; thus, it is one of the most common ways of achieving Artificial Intelligence.
Abstract: Text data mining, or simply text mining, encompasses tasks that typically analyze vast amounts of digitized text to detect patterns of use and then extract useful information in the search for knowledge; thus, it is one way of achieving artificial intelligence. In other words, text mining is the process of extracting value from text data. Text mining is grounded on data mining, so both fields of data science share many similarities, e.g., in the use of machine learning algorithms. However, data mining usually deals with structured data sets containing numerical data, whereas text mining aims to process unstructured or semi-structured data mainly in the form of text documents. For this reason, pre-processing techniques in text mining focus on identifying and extracting significant features from text data. Moreover, text mining benefits from the advances in natural language processing, particularly when transforming unstructured text into structured data suitable for analysis. With the exponential growth of data in the Internet era, text mining has attracted much attention as part of efforts to reduce the problem of information overload. Indeed, Web mining, which aims to discover and analyze relevant information from heterogeneous data on the Web as in the case of user-generated content from social media, requires significant advances in text mining technologies within a data fusion framework. This article is organized into two main topics: machine learning models and algorithms, which aim to discover knowledge from new data, and text-mining applications, which illustrate various tasks that can extract information from texts.

Posted ContentDOI
19 Dec 2022
TL;DR: This article improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text, which can be used individually or as a part of generating the summary to overcome coverage problems.
Abstract: In the past few decades, there has been an explosion in the amount of available data produced from various sources with different topics. The availability of this enormous data necessitates us to adopt effective computational tools to explore the data. This leads to an intense growing interest in the research community to develop computational methods focused on processing this text data. A line of study focused on condensing the text so that we are able to get a higher level of understanding in a shorter time. The two important tasks to do this are keyword extraction and text summarization. In keyword extraction, we are interested in finding the key important words from a text. This makes us familiar with the general topic of a text. In text summarization, we are interested in producing a short-length text which includes important information about the document. The TextRank algorithm, an unsupervised learning method that is an extension of the PageRank (algorithm which is the base algorithm of Google search engine for searching pages and ranking them) has shown its efficacy in large-scale text mining, especially for text summarization and keyword extraction. this algorithm can automatically extract the important parts of a text (keywords or sentences) and declare them as the result. However, this algorithm neglects the semantic similarity between the different parts. In this work, we improved the results of the TextRank algorithm by incorporating the semantic similarity between parts of the text. Aside from keyword extraction and text summarization, we develop a topic clustering algorithm based on our framework which can be used individually or as a part of generating the summary to overcome coverage problems.

Book ChapterDOI
01 Jan 2022
TL;DR: The authors proposed a text summarization model using NLP techniques that can understand the context of the entire text, identify the most important portions of the text, and generate coherent summaries.
Abstract: AbstractWith the advancement of technology, text is abundant in today’s world, especially on the web. Therefore, it is important to summarize the text so that it becomes easier to read and understand while maintaining the essence and context of the information. Automatic text summarization is an effective way of finding relevant and important information precisely in large text in a short amount of time with little efforts. In this paper, we propose a text summarization model using NLP techniques that can understand the context of the entire text, identify the most important portions of the text, and generate coherent summaries.KeywordsAbstractive text summarizationNatural language processingText-to-text transfer transformer architecture

Proceedings ArticleDOI
23 Sep 2022
TL;DR: A survey of research in text classification to create taxonomies is presented, giving vital effects, the direction of future research, and those challenges that may counter in the research field.
Abstract: Text classification is to organizing documents into predetermined categories, usually by machinery learn algorithms. It is a significant ways to organize and utilize the large amount of information that exists in unstructured text format. Text classification is an important module in text processing, and its applications are also very extensive, such as garbage filtering, news classification, part-of-speech tagging, and so on. With the continuous development of deep learning in recent years! Its applications are also very extensive, such as: garbage filtering, news classification, part-of-speech tagging, and so on. But the text also has its own characteristics. According to the characteristics of the text, the general process of text classification is: 1. Preprocessing; 2. Text representation and feature selection; 3. Construction of a classifier; 4. The task of text classification refers to classifying texts into only single or many types in TC system. Some researchers are beginning to apply deep neural networks to tasks such as the text classification we mentioned above. Although the research around the task has made great progress, the review of this task is very scarce, and there is a lack of a comprehensive review of the development of the task in recent years. Therefore, we present a survey of research in text classification to create taxonomies. Finally, it is by giving vital effects, the direction of future research, and those challenges that may counter in the research field.