Bengali text document categorization based on very deep convolution neural network

doi:10.1016/J.ESWA.2021.115394

Open AccessJournal ArticleDOI

Bengali text document categorization based on very deep convolution neural network

Md. Rajib Hossain, +3 more

- 02 Jul 2021 -

Expert Systems With Applications

- Vol. 184, pp 115394

Chats0

TLDR

The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier, and the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali).

Abstract:

In recent years, the amount of digital text contents or documents in the Bengali language has increased enormously on online platforms due to the effortless access of the Internet via electronic gadgets. As a result, an enormous amount of unstructured data is created that demands much time and effort to organize, search or manipulate. To manage such a massive number of documents effectively, an intelligent text document classification system is proposed in this paper. Intelligent classification of text document in a resource-constrained language (like Bengali) is challenging due to unavailability of linguistic resources, intelligent NLP tools, and larger text corpora. Moreover, Bengali texts are available in two morphological variants (i.e., Sadhu-bhasha and Cholito-bhasha) making the classification task more complicated. The proposed intelligent text classification model comprises GloVe embedding and Very Deep Convolution Neural Network (VDCNN) classifier. Due to the unavailability of standard corpus, this work develops a large Embedding Corpus (EC) containing 969 , 000 unlabelled texts and Bengali Text Classification Corpus (BDTC) containing 156 , 207 labelled documents arranged into 13 categories. Moreover, this work proposes the Embedding Parameters Identification (EPI) Algorithm, which selects the best embedding parameters for low-resource languages (including Bengali). Evaluation of 165 embedding models with intrinsic evaluators (semantic & syntactic similarity measures) shows that the GloVe model is more suitable (regarding Spearman & Pearson correlation) than other embeddings (Word2Vec, FastText, m-BERT) in Bengali text. Experimental results on the test dataset confirm that the proposed GloVe + VDCNN model outperformed (achieving the highest 96.96 % accuracy) the other classification models and existing methods to perform the Bengali text classification task.

Bengali text document categorization based on very deep convolution neural network

Citations

Research on Dual Channel News Headline Classification Based on ERNIE Pre-training Model

SnTiEmd: Sentiment Specific Embedding Model Generation and Evaluation for a Resource Constraint Language

CovTiNet: Covid text identification network using attention-based positional embedding feature fusion

CovTexMiner: Covid Text Mining Using CNN with Domain-Specific GloVe Embedding

A dictionary based model for bengali document classification

References

Deep Residual Learning for Image Recognition

Long short-term memory

LIBSVM: A library for support vector machines

A Coefficient of agreement for nominal Scales

Glove: Global Vectors for Word Representation

Related Papers (5)

Industry Specific Word Embedding and its Application in Log Classification

An Intelligent System Based on Statistical Learning For Searching in Arabic Text

An Automated Approach for Bangla Sentence Classification Using Supervised Algorithms

A survey on text document categorization using enhanced sentence vector space model and bi-gram text representation model based on novel fusion techniques

A novel semantic level text classification by combining NLP and Thesaurus concepts