scispace - formally typeset
Search or ask a question
Author

Sushila Palwe

Bio: Sushila Palwe is an academic researcher. The author has contributed to research in topics: Topic model & Newspaper. The author has an hindex of 1, co-authored 1 publications receiving 1 citations.

Papers
More filters
Book ChapterDOI
01 Jan 2018
TL;DR: A word co-occurrence network-based model named WNTM is presented, which works for both long and short news by overcoming its shortcomings, and is intended to create a news recommendation system, which would recommend news to the user according to user preference.
Abstract: News media includes print media, broadcast news, and Internet (online newspapers, news blogs, etc.). The proposed system intends to collect news data from such diverse sources, capture the varied perceptions, summarize, and present the news. It involves identifying topic from real-time news extractions, then perform clustering of the news documents based on the topics. Previous approaches, like LDA, identify topics efficiently for long news texts, however, fail to do so in case of short news texts. In short news texts, the issues of acute sparsity and irregularity are prevalent. In this paper, we present a solution for topic modeling, i.e, a word co-occurrence network-based model named WNTM, which works for both long and short news by overcoming its shortcomings. It effectively works without wasting much time and space complexity. Further, we intend to create a news recommendation system, which would recommend news to the user according to user preference.

6 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: The proposed k-means topic modeling (KTM) approach is applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.
Abstract: Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.

11 citations

Proceedings ArticleDOI
15 Dec 2022
TL;DR: In this article , a statistical text summarization technique on Gujarati text which is one of the resource-poor South Asian languages has been performed by using TF-IDF, LSA, and LDA methods on the custom dataset.
Abstract: Automatic text summarization is an essential part of Natural language processing(NLP), a subpart of the Artificial Intelligence domain. Widespread usage of text summarization is due to the massive usage of the internet in every aspect of life. In this research article, we perform a statistical text summarization technique on Gujarati text which is one of the resource-poor South Asian languages. We have performed TF-IDF, LSA, and LDA methods on our custom dataset. We evaluated our summary using the rouge score using 10%,20%, and 30% compression ratios. We have used Rouge-1, Rouge-2, Rouge-w, and Rouge-l to measure the accuracy, and LDA gets the highest rouge score among other methods. All the results are formed in table format with an individual rouge score and an average rouge score of all the methods. This article aims to analyze the performance of the unsupervised method in automatic text summarization methods of Gujarati language without any pre-processing technique. Sentences are selected using a concept-based method based on outside information [4], [5]. The title matching in the main text is covered by the topic-based idea. When the title words are identical, the phrase receives a good grade. If not, the sentence in question won’t be included in the summary [6], [7]. Depending on the topic, the cluster-based method organises comparable sentences. In this procedure, cluster counts must be specified [8] –[11]. The similarity notion is the foundation of the graph-based method. It compares the similarity of all the words and determines the best phrases using those results. Numerous studies on graph-based approaches have been conducted [12] through [16].

1 citations

DOI
15 Dec 2022
TL;DR: In this paper , a statistical text summarization technique on Gujarati text which is one of the resource-poor South Asian languages has been performed by using TF-IDF, LSA, and LDA methods on the custom dataset.
Abstract: Automatic text summarization is an essential part of Natural language processing(NLP), a subpart of the Artificial Intelligence domain. Widespread usage of text summarization is due to the massive usage of the internet in every aspect of life. In this research article, we perform a statistical text summarization technique on Gujarati text which is one of the resource-poor South Asian languages. We have performed TF-IDF, LSA, and LDA methods on our custom dataset. We evaluated our summary using the rouge score using 10%,20%, and 30% compression ratios. We have used Rouge-1, Rouge-2, Rouge-w, and Rouge-l to measure the accuracy, and LDA gets the highest rouge score among other methods. All the results are formed in table format with an individual rouge score and an average rouge score of all the methods. This article aims to analyze the performance of the unsupervised method in automatic text summarization methods of Gujarati language without any pre-processing technique. Sentences are selected using a concept-based method based on outside information [4], [5]. The title matching in the main text is covered by the topic-based idea. When the title words are identical, the phrase receives a good grade. If not, the sentence in question won’t be included in the summary [6], [7]. Depending on the topic, the cluster-based method organises comparable sentences. In this procedure, cluster counts must be specified [8] –[11]. The similarity notion is the foundation of the graph-based method. It compares the similarity of all the words and determines the best phrases using those results. Numerous studies on graph-based approaches have been conducted [12] through [16].

1 citations

Proceedings ArticleDOI
15 Dec 2022
TL;DR: In this article , a mathematical model was proposed to generate summary that does not include some important sentences, which is called secretary problem, which comes under the extractive text summarization method.
Abstract: Automatic text summarization is a long-running program, a crucial part of NLP(Natural language processing), which is again a subpart of Artificial Intelligence(AI). Many successful research techniques have been invented for summarization purposes. We propose a unique concept to generate the summary using the secretary problem, which comes under the extractive text summarization method. We divide the document text into two parts. We will match our text with the main title in the first part. If the main title’s words match the document text sentence, maintain those sentences in one of the lists. We will apply the secretary problem in other sentences that do not have title words. Combined with other sentence generation methods, the Secretary problem will guarantee the best candidate one-third of the time or 37%. This article presents our concept of leveraging a mathematical model to generate a summary that does not include some important sentences.
Journal ArticleDOI
TL;DR: In this paper , a vector space model (VSM) is used to extract a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them, which is then used as a query in the VSM for sentence classification and extraction.
Abstract: A COVID-19 news covers subtopics like infections, deaths, the economy, jobs, and more. The proposed method generates a news summary based on the subtopics of a reader's interest. It extracts a centroid having the lexical pattern of the sentences on those subtopics by the frequently used words in them. The centroid is then used as a query in the vector space model (VSM) for sentence classification and extraction, producing a query focused summarization (QFS) of the documents. Three approaches, TF-IDF, word vector averaging, and auto-encoder are experimented to generate sentence embedding that are used in VSM. These embeddings are ranked depending on their similarities with the query embedding. A Novel approach has been introduced to find the value for the similarity parameter using a supervised technique to classify the sentences. Finally, the performance of the method has been assessed in two different ways. All the sentences of the dataset are considered together in the first assessment and in the second, each document wise group of sentences is considered separately using fivefold cross-validation. The proposed method has achieved a minimum of 0.60 to a maximum of 0.63 mean F1 scores with the three sentence encoding approaches on the test dataset.