scispace - formally typeset
Open AccessJournal ArticleDOI

Bengali text document categorization based on very deep convolution neural network

Reads0
Chats0
TLDR
The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier, and the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali).
Abstract
In recent years, the amount of digital text contents or documents in the Bengali language has increased enormously on online platforms due to the effortless access of the Internet via electronic gadgets. As a result, an enormous amount of unstructured data is created that demands much time and effort to organize, search or manipulate. To manage such a massive number of documents effectively, an intelligent text document classification system is proposed in this paper. Intelligent classification of text document in a resource-constrained language (like Bengali) is challenging due to unavailability of linguistic resources, intelligent NLP tools, and larger text corpora. Moreover, Bengali texts are available in two morphological variants (i.e., Sadhu-bhasha and Cholito-bhasha) making the classification task more complicated. The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier. Due to the unavailability of standard corpus, this work develops a large Embedding Corpus (EC) containing 969 , 000 unlabelled texts and Bengali Text Classification Corpus (BDTC) containing 156 , 207 labelled documents arranged into 13 categories. Moreover, this work proposes the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali). Evaluation of 165 embedding models with intrinsic evaluators (semantic & syntactic similarity measures) shows that the GloVe model is more suitable (regarding Spearman & Pearson correlation) than other embeddings (Word2Vec, FastText, m-BERT) in Bengali text. Experimental results on the test dataset confirm that the proposed GloVe + VDCNN model outperformed (achieving the highest 96.96 % accuracy) the other classification models and existing methods to perform the Bengali text classification task.

read more

Citations
More filters
Proceedings ArticleDOI

Research on Dual Channel News Headline Classification Based on ERNIE Pre-training Model

Junjie Li, +1 more
TL;DR: The proposed dual-channel network model DC-EBAD based on the ERNIE pre-training model improves the accuracy, precision and F1-score of news headline classification compared with the traditional neural network model and the single-channel model under the same conditions.
Journal ArticleDOI

CovTiNet: Covid text identification network using attention-based positional embedding feature fusion

TL;DR: In this article , the authors proposed a deep learning-based network (CovTiNet) to identify Covid text in Bengali, which incorporates an attention-based position embedding feature fusion for text-to-feature representation.
References
More filters
Proceedings ArticleDOI

Deep Residual Learning for Image Recognition

TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Journal ArticleDOI

A Coefficient of agreement for nominal Scales

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Proceedings ArticleDOI

Glove: Global Vectors for Word Representation

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Related Papers (5)