scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Vectorization of Text Documents for Identifying Unifiable News Articles

01 Jan 2019-International Journal of Advanced Computer Science and Applications (The Science and Information (SAI) Organization Limited)-Vol. 10, Iss: 7
TL;DR: A framework is introduced for identification of news articles related to top trending topics/hashtags and multi-document summarization of unifiable news articles based on the trending topics for capturing opinion diversity on those topics.
Abstract: Vectorization is imperative for processing textual data in natural language processing applications. Vectorization enables the machines to understand the textual contents by converting them into meaningful numerical representations. The proposed work targets at identifying unifiable news articles for performing multi-document summarization. A framework is introduced for identification of news articles related to top trending topics/hashtags and multi-document summarization of unifiable news articles based on the trending topics, for capturing opinion diversity on those topics. Text clustering is applied to the corpus of news articles related to each trending topic to obtain smaller unifiable groups. The effectiveness of various text vectorization methods, namely the bag of word representations with tf-idf scores, word embeddings, and document embeddings are investigated for clustering news articles using the k-means. The paper presents the comparative analysis of different vectorization methods obtained on documents from DUC 2004 benchmark dataset in terms of purity.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
15 Apr 2020
TL;DR: The experiments show that according to the type of content and metric, the performance of the feature extraction methods is very different; in some cases are better than the others, and in other cases is the inverse.
Abstract: This paper analyses the capabilities of different techniques to build a semantic representation of educational digital resources. Educational digital resources are modeled using the Learning Object Metadata (LOM) standard, and these semantic representations can be obtained from different LOM fields, like the title, description, among others, in order to extract the features/characteristics from the digital resources. The feature extraction methods used in this paper are the Best Matching 25 (BM25), the Latent Semantic Analysis (LSA), Doc2Vec, and the Latent Dirichlet allocation (LDA). The utilization of the features/descriptors generated by them are tested in three types of educational digital resources (scientific publications, learning objects, patents), a paraphrase corpus and two use cases: in an information retrieval context and in an educational recommendation system. For this analysis are used unsupervised metrics to determine the feature quality proposed by each one, which are two similarity functions and the entropy. In addition, the paper presents tests of the techniques for the classification of paraphrases. The experiments show that according to the type of content and metric, the performance of the feature extraction methods is very different; in some cases are better than the others, and in other cases is the inverse.

22 citations


Cites background from "Vectorization of Text Documents for..."

  • ...In addition, it is the only one that uses learning resources and patents, and only another one uses a scientific publication dataset in its analysis [6]....

    [...]

  • ...[6] propose a vectorization approach based on word targets, to identify unifiable news articles....

    [...]

  • ...Only [6] considers unsupervised metrics, and there are several works that consider document similarity, but none information theory metrics....

    [...]

Journal ArticleDOI
TL;DR: This paper proposes models of machine learning that can successfully detect fake news and applies three machine learning models; random forest classifier, logistic regression, and term frequency-inverse document frequency (TF-IDF) vectorizer.
Abstract: Before the internet, people acquired their news from the radio, television, and newspapers. With the internet, the news moved online, and suddenly, anyone could post information on websites such as Facebook and Twitter. The spread of fake news has also increased with social media. It has become one of the most significant issues of this century. People use the method of fake news to pollute the reputation of a well-reputed organization for their benefit. The most important reason for such a project is to frame a device to examine the language designs that describe fake and right news through machine learning. This paper proposes models of machine learning that can successfully detect fake news. These models identify which news is real or fake and specify the accuracy of said news, even in a complex environment. After data-preprocessing and exploration, we applied three machine learning models; random forest classifier, logistic regression, and term frequency-inverse document frequency (TF-IDF) vectorizer. The accuracy of the TFIDF vectorizer, logistic regression, random forest classifier, and decision tree classifier models was approximately 99.52%, 98.63%, 99.63%, and 99.68%, respectively. Machine learning models can be considered a great choice to find reality-based results and applied to other unstructured data for various sentiment analysis applications.

21 citations

Journal ArticleDOI
TL;DR: This review article presents methods for the automatic detection of crisis-related messages (tweets) on Twitter and compares approaches for solving the detection problem based on filtering by characteristics like keywords and location, on crowdsourcing, and on machine learning technique.
Abstract: . Messages on social media can be an important source of information during crisis situations. They can frequently provide details about developments much faster than traditional sources (e.g., official news) and can offer personal perspectives on events, such as opinions or specific needs. In the future, these messages can also serve to assess disaster risks. One challenge for utilizing social media in crisis situations is the reliable detection of relevant messages in a flood of data. Researchers have started to look into this problem in recent years, beginning with crowdsourced methods. Lately, approaches have shifted towards an automatic analysis of messages. A major stumbling block here is the question of exactly what messages are considered relevant or informative, as this is dependent on the specific usage scenario and the role of the user in this scenario. In this review article, we present methods for the automatic detection of crisis-related messages (tweets) on Twitter. We start by showing the varying definitions of importance and relevance relating to disasters, leading into the concept of use case-dependent actionability that has recently become more popular and is the focal point of the review paper. This is followed by an overview of existing crisis-related social media data sets for evaluation and training purposes. We then compare approaches for solving the detection problem based (1) on filtering by characteristics like keywords and location, (2) on crowdsourcing, and (3) on machine learning technique. We analyze their suitability and limitations of the approaches with regards to actionability. We then point out particular challenges, such as the linguistic issues concerning social media data. Finally, we suggest future avenues of research and show connections to related tasks, such as the subsequent semantic classification of tweets.

18 citations


Cites background or methods from "Vectorization of Text Documents for..."

  • ..., 2019), queries, or summarization (Singh and Shashi, 2019)....

    [...]

  • ...In addition to their usage as classification inputs, embeddings can also be used in other ways, such as keyword or descriptive word expansion (Viegas et al., 2019; Qiang et al., 2019), clustering (Hadifar et al., 2019; Comito et al., 2019), queries, or summarization (Singh and Shashi, 2019)....

    [...]

  • ..., 2017), and content-based features (Mendonça et al., 2019; Comito et al., 2019; Singh and Shashi, 2019; Fedoryszak et al., 2019) as well as combinations of these (Nguyen and Shin, 2017; Zhang and Eick, 2019) are available....

    [...]

  • ...…(Ester et al., 1996), spatiotemporal (Birant and Kut, 2007; Lee et al., 2017), and content-based features (Mendonça et al., 2019; Comito et al., 2019; Singh and Shashi, 2019; Fedoryszak et al., 2019) as well as combinations of these (Nguyen and Shin, 2017; Zhang and Eick, 2019) are available....

    [...]

Proceedings ArticleDOI
02 Jul 2020
TL;DR: The objective here is to automatically scrape news from English news websites and identify disaster relevant news using natural language processing techniques and machine learning concepts, which can further be dynamically displayed on the crisis management websites.
Abstract: We are living in unprecedented times and anyone in this world could be impacted by natural disasters in some way or the other. Life is unpredictable and what is to come is unforeseeable. Nobody knows what the very next moment will hold, maybe it could be a disastrous one too. The past cannot be changed but it can act constructively towards the betterment of the current situation, ‘Precaution is better than cure’. To be above this uncertain dilemma of life and death situations, ‘Automated Identification of Disaster News for Crisis Management is proposed using Machine Learning and Natural Language Processing’. A software solution that can help disaster management websites to dynamically show the disaster relevant news which can be shared to other social media handles through their sites. The objective here is to automatically scrape news from English news websites and identify disaster relevant news using natural language processing techniques and machine learning concepts, which can further be dynamically displayed on the crisis management websites. The complete model is automated and requires no manual labor at all. The architecture is based on Machine Learning principles that classifies news scraped from top news websites using a spider-scraper into two categories, one being disaster relevant news and other being disaster irrelevant news and eventually displaying the relevant disaster news on the crisis management website.

14 citations


Cites methods from "Vectorization of Text Documents for..."

  • ...5) Text Vectorization: In this step, the text data is converted into numerical form [14]....

    [...]

Journal ArticleDOI
TL;DR: This study proposes an architecture that combines sentiment analysis and community detection to get an overall sentiment of related topics and applies that model on the following topics: shopping, politics, covid19 and electric vehicles to understand emerging trends, issues and its possible marketing, business and political implications.
Abstract: Microblogging has taken a considerable upturn in recent years, with the growth of microblogging websites like Twitter people have started to share more of their opinions about various pressing issues on such online social networks. A broader understanding of the domain in question is required to make an informed decision. With this motivation, our study focuses on finding overall sentiments of related topics with reference to a given topic. We propose an architecture that combines sentiment analysis and community detection to get an overall sentiment of related topics. We apply that model on the following topics: shopping, politics, covid19 and electric vehicles to understand emerging trends, issues and its possible marketing, business and political implications.

12 citations

References
More filters
Proceedings ArticleDOI
01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Abstract: Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

30,558 citations


"Vectorization of Text Documents for..." refers background in this paper

  • ...The GloVe [2] scores represent the frequency of co-occurrence of a word with other words....

    [...]

  • ...GloVe: Is a count-based model which constructs a global co-occurrence matrix where each row of the matrix is a word while each column represents the contexts in which the word can appear....

    [...]

  • ...Pennington et al. [2] in 2014, released GloVe, a competitive set of pre-trained word embeddings without using neural networks, signalling that word embeddings had reached the mainstream....

    [...]

  • ...[2] in 2014, released GloVe, a competitive set of pre-trained word embeddings without using neural networks, signalling that word embeddings had reached the mainstream....

    [...]

  • ...GloVe learns its vectors after calculating the co-occurrences using dimensionality reduction....

    [...]

Proceedings ArticleDOI
11 Oct 2018
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5 (7.7 point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

24,672 citations

01 Jan 1967
TL;DR: The k-means algorithm as mentioned in this paper partitions an N-dimensional population into k sets on the basis of a sample, which is a generalization of the ordinary sample mean, and it is shown to give partitions which are reasonably efficient in the sense of within-class variance.
Abstract: The main purpose of this paper is to describe a process for partitioning an N-dimensional population into k sets on the basis of a sample. The process, which is called 'k-means,' appears to give partitions which are reasonably efficient in the sense of within-class variance. That is, if p is the probability mass function for the population, S = {S1, S2, * *, Sk} is a partition of EN, and ui, i = 1, 2, * , k, is the conditional mean of p over the set Si, then W2(S) = ff=ISi f z u42 dp(z) tends to be low for the partitions S generated by the method. We say 'tends to be low,' primarily because of intuitive considerations, corroborated to some extent by mathematical analysis and practical computational experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer. Possible applications include methods for similarity grouping, nonlinear prediction, approximating multivariate distributions, and nonparametric tests for independence among several variables. In addition to suggesting practical classification methods, the study of k-means has proved to be theoretically interesting. The k-means concept represents a generalization of the ordinary sample mean, and one is naturally led to study the pertinent asymptotic behavior, the object being to establish some sort of law of large numbers for the k-means. This problem is sufficiently interesting, in fact, for us to devote a good portion of this paper to it. The k-means are defined in section 2.1, and the main results which have been obtained on the asymptotic behavior are given there. The rest of section 2 is devoted to the proofs of these results. Section 3 describes several specific possible applications, and reports some preliminary results from computer experiments conducted to explore the possibilities inherent in the k-means idea. The extension to general metric spaces is indicated briefly in section 4. The original point of departure for the work described here was a series of problems in optimal classification (MacQueen [9]) which represented special

24,320 citations


"Vectorization of Text Documents for..." refers methods in this paper

  • ...The unifiable news article identification phase of the proposed framework studies TD-IDF, Word2Vec and Doc2Vec vectorization methods in detail and clusters the articles using the k-means clustering [12]....

    [...]

Posted Content
TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.

20,077 citations

Proceedings Article
Quoc V. Le1, Tomas Mikolov1
21 Jun 2014
TL;DR: Paragraph Vector is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents, and its construction gives the algorithm the potential to overcome the weaknesses of bag-of-words models.
Abstract: Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperforms bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks.

7,119 citations


"Vectorization of Text Documents for..." refers methods or result in this paper

  • ...Doc2Vec: Doc2Vec [5] is an extension of Word2Vec or rather SentenceToVec as sentences are a part of documents, and the procedure of obtaining the Doc2Vec embeddings is similar to that of SentenceToVec....

    [...]

  • ...The Doc2Vec [5] model architecture also has two underlying algorithms the distributed memory paragraph vectors(dmpv) as shown in figure 3 and the distributed bag of words (dbow) shown in figure 4....

    [...]

Trending Questions (1)
What is the purpose of text vectorization?

The purpose of text vectorization is to convert textual data into numerical representations that can be understood and processed by machines.