scispace - formally typeset
Search or ask a question

Showing papers on "Word embedding published in 2022"


Journal ArticleDOI
TL;DR: In this article , a Hierarchical Deep Word Embedding (HDWE) model by integrating sparse constraints and an improved RELU operator is proposed to address click feature prediction from visual features.
Abstract: The click feature of an image, defined as the user click frequency vector of the image on a predefined word vocabulary, is known to effectively reduce the semantic gap for fine-grained image recognition. Unfortunately, user click frequency data are usually absent in practice. It remains challenging to predict the click feature from the visual feature, because the user click frequency vector of an image is always noisy and sparse. In this paper, we devise a Hierarchical Deep Word Embedding (HDWE) model by integrating sparse constraints and an improved RELU operator to address click feature prediction from visual features. HDWE is a coarse-to-fine click feature predictor that is learned with the help of an auxiliary image dataset containing click information. It can therefore discover the hierarchy of word semantics. We evaluate HDWE on three dog and one bird image datasets, in which Clickture-Dog and Clickture-Bird are utilized as auxiliary datasets to provide click data, respectively. Our empirical studies show that HDWE has 1) higher recognition accuracy, 2) a larger compression ratio, and 3) good one-shot learning ability and scalability to unseen categories.

102 citations


Journal ArticleDOI
TL;DR: In this article , an automated word embedding with parameter tuned deep learning (AWE-PTDL) model is proposed for focused web crawling, which involves different processes such as preprocessing, incremental skip-gram model with negative sampling (ISGNS), bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning.
Abstract: In recent years, web crawling has gained a significant attention due to the drastic advancements in the World Wide Web. Web Search Engines have the issue of retrieving massive quantity of web documents. One among the web crawlers is the focused crawler, that intends to selectively gather web pages from the Internet. But the efficiency of the focused crawling can easily be affected by the environment of web pages. In this view, this paper presents an Automated Word Embedding with Parameter Tuned Deep Learning (AWE-PTDL) model for focused web crawling. The proposed model involves different processes namely pre-processing, Incremental Skip-gram Model with Negative Sampling (ISGNS) based word embedding, bidirectional long short-term memory-based classification and bird swarm optimization based hyperparameter tuning. The SGNS training desires to go over the complete training data to pre-compute the noise distribution before performing Stochastic Gradient Descent (SGD) and the ISGNS technique is derived for the word embedding process. Besides, the cosine similarity is computed from the word embedding matrix to generate a feature vector which is fed as input into the Bidirectional Long Short-Term Memory (BiLSTM) for the prediction of website relevance. Finally, the Birds Swarm Optimization-Bidirectional Long Short-Term Memory (BSO-BiLSTM) based classification model is used to classify the webpages and the BSO algorithm is employed to determine the hyperparameters of the BiLSTM model so that the overall crawling performance can be considerably enhanced. For validating the enhanced outcome of the presented model, a comprehensive set of simulations are carried out and the results are examined in terms of different measures. The Automated Word Embedding with Parameter Tuned Deep Learning (AWE-PTDL) technique has attained a higher harvest rate of 85% when compared with the other techniques. The experimental results highlight the enhanced web crawling performance of the proposed model over the recent state of art web crawlers. This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Intelligent Automation & Soft Computing DOI:10.32604/iasc.2022.022209 Article ech T Press Science

63 citations


Journal ArticleDOI
TL;DR: In this paper, the authors proposed a heuristic method to build a recommender engine in IoT environment exploiting swarm intelligence techniques, where smart objects are represented using real-valued vectors obtained through the Doc2Vec model, a word embedding technique able to capture the semantic context representing documents and sentences with dense vectors.
Abstract: In smart environments, traditional information management approaches are often unsuitable to tackle with the needed elaborations due to the amount and the high dynamicity of entities involved. Smart objects (enhanced devices or IoT services belonging to a smart system) interact and maintain relations which need of effective and efficient selection/filtering mechanisms to better meet users’ requirements. Recommender systems provide useful and customized information, properly selected and filtered, for users and services. This paper proposes a heuristic method to build a recommender engine in IoT environment exploiting swarm intelligence techniques. Smart objects are represented using real-valued vectors obtained through the Doc2Vec model, a word embedding technique able to capture the semantic context representing documents and sentences with dense vectors. The vectors are associated to mobile agents that move in a virtual 2D space following a bio-inspired model - the flocking model - in which agents perform simple and local operations autonomously obtaining a global intelligent organization. A similarity rule, based on the assigned vectors, was designed so enabling agents to discriminate among them. A closer positioning (clustering) of only similar agents is achieved. The intelligent positioning allows easy identifying of similar smart objects, thus enabling a fast and effective selection operations. Experimental evaluations have allowed to demonstrate the validity of the approach, and on how the proposed methodology allows obtaining an increasing in performance of about 50%, in terms of clustering quality and relevance, compared to other existing approaches.

33 citations


Proceedings ArticleDOI
14 Feb 2022
TL;DR: A thorough structural analysis is conducted aiming to provide an interpretation of pre-trained language models for source code from three distinctive perspectives: attention analysis, probing on the word embedding, and syntax tree induction.
Abstract: Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These models leverage masked pre-training and Transformer and have achieved promising results. However, currently there is still little progress regarding interpretability of existing pre-trained code models. It is not clear why these models work and what feature correlations they can capture. In this paper, we conduct a thorough structural analysis aiming to provide an interpretation of pre-trained language models for source code (e.g., CodeBERT, and GraphCodeBERT) from three distinctive perspectives: (1) attention analysis, (2) probing on the word embedding, and (3) syntax tree induction. Through comprehensive analysis, this paper reveals several insightful findings that may inspire future studies: (1) Attention aligns strongly with the syntax structure of code. (2) Pre-training language models of code can preserve the syntax structure of code in the intermediate representations of each Transformer layer. (3) The pre-trained models of code have the ability of inducing syntax trees of code. Theses findings suggest that it may be helpful to incorporate the syntax structure of code into the process of pre-training for better code representations.

32 citations


Journal ArticleDOI
30 May 2022-Sensors
TL;DR: A state-of-the-art binary classification performance for Bangla sentiment analysis that significantly outperforms all embedding and algorithms is shown.
Abstract: The growth of the Internet has expanded the amount of data expressed by users across multiple platforms. The availability of these different worldviews and individuals’ emotions empowers sentiment analysis. However, sentiment analysis becomes even more challenging due to a scarcity of standardized labeled data in the Bangla NLP domain. The majority of the existing Bangla research has relied on models of deep learning that significantly focus on context-independent word embeddings, such as Word2Vec, GloVe, and fastText, in which each word has a fixed representation irrespective of its context. Meanwhile, context-based pre-trained language models such as BERT have recently revolutionized the state of natural language processing. In this work, we utilized BERT’s transfer learning ability to a deep integrated model CNN-BiLSTM for enhanced performance of decision-making in sentiment analysis. In addition, we also introduced the ability of transfer learning to classical machine learning algorithms for the performance comparison of CNN-BiLSTM. Additionally, we explore various word embedding techniques, such as Word2Vec, GloVe, and fastText, and compare their performance to the BERT transfer learning strategy. As a result, we have shown a state-of-the-art binary classification performance for Bangla sentiment analysis that significantly outperforms all embedding and algorithms.

32 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed a hybrid deep learning method that combines the strengths of sequence model and Transformer model while suppressing the limitations of the sequence model, achieving state-of-the-art performance on IMDb dataset, Twitter US Airline Sentiment dataset, and Sentiment140 dataset.
Abstract: Due to the rapid development of technology, social media has become more and more common in human daily life. Social media is a platform for people to express their feelings, feedback, and opinions. To understand the sentiment context of the text, sentiment analysis plays the role to determine whether the sentiment of the text is positive, negative, neutral or any other personal feeling. Sentiment analysis is prominent from the perspective of business or politics where it highly impacts the strategic decision making. The challenges of sentiment analysis are attributable to the lexical diversity, imbalanced dataset and long-distance dependencies of the texts. In view of this, a data augmentation technique with GloVe word embedding is leveraged to synthesize more lexically diverse samples by similar word vector replacements. The data augmentation also focuses on the oversampling of the minority classes to mitigate the imbalanced dataset problems. Apart from that, the existing sentiment analysis mostly leverages sequence models to encode the long-distance dependencies. Nevertheless, the sequence models require a longer execution time as the processing is done sequentially. On the other hand, the Transformer models require less computation time with parallelized processing. To that end, this paper proposes a hybrid deep learning method that combines the strengths of sequence model and Transformer model while suppressing the limitations of sequence model. Specifically, the proposed model integrates Robustly optimized BERT approach and Long Short-Term Memory for sentiment analysis. The Robustly optimized BERT approach maps the words into a compact meaningful word embedding space while the Long Short-Term Memory model captures the long-distance contextual semantics effectively. The experimental results demonstrate that the proposed hybrid model outshines the state-of-the-art methods by achieving F1-scores of 93%, 91%, and 90% on IMDb dataset, Twitter US Airline Sentiment dataset, and Sentiment140 dataset, respectively.

26 citations


Journal ArticleDOI
TL;DR: The results show that the SVM classifier and the Word2Vec CBOW (Continuous Bag of Words) model are more beneficial options for Roman Urdu sentiment analysis, but that BERT word embedding, two-layer LSTM, and SVM as a classifier function are more suitable options for English language sentiment analysis.
Abstract: Sentiment analysis (SA) has been an active research subject in the domain of natural language processing due to its important functions in interpreting people’s perspectives and drawing successful opinion-based judgments. On social media, Roman Urdu is one of the most extensively utilized dialects. Sentiment analysis of Roman Urdu is difficult due to its morphological complexities and varied dialects. The purpose of this paper is to evaluate the performance of various word embeddings for Roman Urdu and English dialects using the CNN-LSTM architecture with traditional machine learning classifiers. We introduce a novel deep learning architecture for Roman Urdu and English dialect SA based on two layers: LSTM for long-term dependency preservation and a one-layer CNN model for local feature extraction. To obtain the final classification, the feature maps learned by CNN and LSTM are fed to several machine learning classifiers. Various word embedding models support this concept. Extensive tests on four corpora show that the proposed model performs exceptionally well in Roman Urdu and English text sentiment classification, with an accuracy of 0.904, 0.841, 0.740, and 0.748 against MDPI, RUSA, RUSA-19, and UCL datasets, respectively. The results show that the SVM classifier and the Word2Vec CBOW (Continuous Bag of Words) model are more beneficial options for Roman Urdu sentiment analysis, but that BERT word embedding, two-layer LSTM, and SVM as a classifier function are more suitable options for English language sentiment analysis. The suggested model outperforms existing well-known advanced models on relevant corpora, improving the accuracy by up to 5%.

24 citations


Journal ArticleDOI
TL;DR: Deep Profile-based Bot detection framework (DeeProBot) as discussed by the authors uses the information from user profile metadata of the Twitter account like description, follower count and tweet count to classify Twitter accounts as either human or bot.
Abstract: Use of online social networks (OSNs) undoubtedly brings the world closer. OSNs like Twitter provide a space for expressing one's opinions in a public platform. This great potential is misused by the creation of bot accounts, which spread fake news and manipulate opinions. Hence, distinguishing genuine human accounts from bot accounts has become a pressing issue for researchers. In this paper, we propose a framework based on deep learning to classify Twitter accounts as either 'human' or 'bot.' We use the information from user profile metadata of the Twitter account like description, follower count and tweet count. We name the framework 'DeeProBot,' which stands for Deep Profile-based Bot detection framework. The raw text from the description field of the Twitter account is also considered a feature for training the model by embedding the raw text using pre-trained Global Vectors (GLoVe) for word representation. Using only the user profile-based features considerably reduces the feature engineering overhead compared with that of user timeline-based features like user tweets and retweets. DeeProBot handles mixed types of features including numerical, binary, and text data, making the model hybrid. The network is designed with long short-term memory (LSTM) units and dense layers to accept and process the mixed input types. The proposed model is evaluated on a collection of publicly available labeled datasets. We have designed the model to make it generalizable across different datasets. The model is evaluated using two ways: testing on a hold-out set of the same dataset; and training with one dataset and testing with a different dataset. With these experiments, the proposed model achieved AUC as high as 0.97 with a selected set of features.

24 citations


Journal ArticleDOI
TL;DR: In this article , the authors proposed an offensive text classification algorithm named LSTM-BOOST employing Long Short-Term Memory(LSTM) model with ensemble learning to recognize offensive Bengali texts in various social media platforms.
Abstract: Recently, offensive content has become increasingly popular for harassing and criticizing people on numerous social media platforms. This paper proposes an offensive text classification algorithm named LSTM-BOOST employing Long Short-Term Memory(LSTM) model with ensemble learning to recognize offensive Bengali texts in various social media platforms. The proposed LSTM-BOOST model uses the modified AdaBoost algorithm employing principal component analysis(PCA) along with LSTM networks. In the LSTM-Boost model, the dataset is divided into three categories, and PCA and LSTM networks are applied to each part of the dataset to obtain the most significant variance and reduce the weighted error of the weak hypothesis of the model. Furthermore, different classifiers are used for baseline experiment and the model is evaluated on various word embedding vector methods. Our investigation found that the LSTM-BOOST algorithms outperform most of the baseline architecture, leading F1-score of 92.61% on the Bengali offensive text from Social Platforms(BHSSP) dataset.

22 citations


Journal ArticleDOI
TL;DR: A machine learning model is proposed to analyze the Arabic tweets from Twitter and shows good improvement on average of F1 score compared to the baseline classifier and other classifiers (single-based and ensemble-based) without SMOTE.
Abstract: Sentiment analysis has recently become increasingly important with a massive increase in online content. It is associated with the analysis of textual data generated by social media that can be easily accessed, obtained, and analyzed. With the emergence of COVID-19, most published studies related to COVID-19’s conspiracy theories were surveys on the people's sentiments and opinions and studied the impact of the pandemic on their lives. Just a few studies utilized sentiment analysis of social media using a machine learning approach. These studies focused more on sentiment analysis of Twitter tweets in the English language and did not pay more attention to other languages such as Arabic. This study proposes a machine learning model to analyze the Arabic tweets from Twitter. In this model, we apply Word2Vec for word embedding which formed the main source of features. Two pretrained continuous bag-of-words (CBOW) models are investigated, and Naïve Bayes was used as a baseline classifier. Several single-based and ensemble-based machine learning classifiers have been used with and without SMOTE (synthetic minority oversampling technique). The experimental results show that applying word embedding with an ensemble and SMOTE achieved good improvement on average of F1 score compared to the baseline classifier and other classifiers (single-based and ensemble-based) without SMOTE.

21 citations


Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors presented two novel attention-based Bi-LSTM architectures to incorporate emoji and textual information at different semantic levels, and investigated how the emoji information contributes to the performance of personality recognition tasks.

Journal ArticleDOI
TL;DR: In this paper , a hybrid approach using Weighted Fine-Tuned BERT Feature extraction with Siamese Bi-LSTM model is implemented for determining question pair sets using Semantic-text-similarity from Quora dataset.
Abstract: The conventional semantic text-similarity methods requires high amount of trained labeled data and also human interventions. Generally, it neglects the contextual-information and word-orders information resulted in data sparseness problem and latitudinal-explosion issue. Recently, deep-learning methods are used for determining text-similarity. Hence, this study investigates NLP application tasks usage in detecting text-similarity of question pairs or documents and explores the similarity score predictions. A new hybridized approach using Weighted Fine-Tuned BERT Feature extraction with Siamese Bi-LSTM model is implemented. The technique is employed for determining question pair sets using Semantic-text-similarity from Quora dataset. The text features are extracted using BERT process, followed by words embedding with weights. The features along with weight values, are represented as embedded vectors, are subjected to various layers of Siamese Networks. The embedded vectors of input text features were trained by using Deep Siamese Bi-LSTM model, in various layers. Finally, similarity scores are determined for each sentence, and the semantic text-similarity is learned. The performance evaluation of proposed-framework is established with respect to accuracy rate, precision value, F1 score data and Recall values parameters compared with other existing text-similarity detection methods. The proposed-framework exhibited higher efficiency rate with 91% in accuracy level in determining semantic-text-similarity compared with other existing algorithms.

Journal ArticleDOI
TL;DR: In this article , an abstractive Arabic text summarization system is proposed, based on a sequence-to-sequence model, which works through two components, encoder and decoder.
Abstract: Text summarization (TS) is considered one of the most difficult tasks in natural language processing (NLP). It is one of the most important challenges that stand against the modern computer system's capabilities with all its new improvement. Many papers and research studies address this task in literature but are being carried out in extractive summarization, and few of them are being carried out in abstractive summarization, especially in the Arabic language due to its complexity. In this paper, an abstractive Arabic text summarization system is proposed, based on a sequence-to-sequence model. This model works through two components, encoder and decoder. Our aim is to develop the sequence-to-sequence model using several deep artificial neural networks to investigate which of them achieves the best performance. Different layers of Gated Recurrent Units (GRU), Long Short-Term Memory (LSTM), and Bidirectional Long Short-Term Memory (BiLSTM) have been used to develop the encoder and the decoder. In addition, the global attention mechanism has been used because it provides better results than the local attention mechanism. Furthermore, AraBERT preprocess has been applied in the data preprocessing stage that helps the model to understand the Arabic words and achieves state-of-the-art results. Moreover, a comparison between the skip-gram and the continuous bag of words (CBOW) word2Vec word embedding models has been made. We have built these models using the Keras library and run-on Google Colab Jupiter notebook to run seamlessly. Finally, the proposed system is evaluated through ROUGE-1, ROUGE-2, ROUGE-L, and BLEU evaluation metrics. The experimental results show that three layers of BiLSTM hidden states at the encoder achieve the best performance. In addition, our proposed system outperforms the other latest research studies. Also, the results show that abstractive summarization models that use the skip-gram word2Vec model outperform the models that use the CBOW word2Vec model.

Journal ArticleDOI
TL;DR: The combination of ML and NLP are implemented to classify fake news based on an open, large and labeled corpus on Twitter and it is found that the neural network models outperform the traditional ML models by, on average, approximately 6% precision, with all Neural network models reaching up to 90% accuracy.
Abstract: Due to the openness and easy accessibility of online social media (OSM), anyone can easily contribute a simple paragraph of text to express their opinion on an article that they have seen. Without access control mechanisms, it has been reported that there are many suspicious messages and accounts spreading across multiple platforms. Accordingly, identifying and labeling fake news is a demanding problem due to the massive amount of heterogeneous content. In essence, the functions of machine learning (ML) and natural language processing (NLP) are to enhance, speed up, and automate the analytical process. Therefore, this unstructured text can be transformed into meaningful data and insights. In this paper, the combination of ML and NLP are implemented to classify fake news based on an open, large and labeled corpus on Twitter. In this case, we compare several state-of-the-art ML and neural network models based on content-only features. To enhance classification performance, before the training process, the term frequency-inverse document frequency (TF-IDF) features were applied in ML training, while word embedding was utilized in neural network training. By implementing ML and NLP methods, all the traditional models have greater than 85% accuracy. All the neural network models have greater than 90% accuracy. From the experiments, we found that the neural network models outperform the traditional ML models by, on average, approximately 6% precision, with all neural network models reaching up to 90% accuracy.

Journal ArticleDOI
TL;DR: A comparative analysis of multiple machine learning and deep learning models to identify suicidal thoughts from the social media platform Twitter reveals that the RF model can achieve the highest classification score among machine learning algorithms, but training the deep learning classifiers with word embedding increases the performance of ML models.
Abstract: Social networks are essential resources to obtain information about people’s opinions and feelings towards various issues as they share their views with their friends and family. Suicidal ideation detection via online social network analysis has emerged as an essential research topic with significant difficulties in the fields of NLP and psychology in recent years. With the proper exploitation of the information in social media, the complicated early symptoms of suicidal ideations can be discovered and hence, it can save many lives. This study offers a comparative analysis of multiple machine learning and deep learning models to identify suicidal thoughts from the social media platform Twitter. The principal purpose of our research is to achieve better model performance than prior research works to recognize early indications with high accuracy and avoid suicide attempts. We applied text pre-processing and feature extraction approaches such as CountVectorizer and word embedding, and trained several machine learning and deep learning models for such a goal. Experiments were conducted on a dataset of 49,178 instances retrieved from live tweets by 18 suicidal and non-suicidal keywords using Python Tweepy API. Our experimental findings reveal that the RF model can achieve the highest classification score among machine learning algorithms, with an accuracy of 93% and an F1 score of 0.92. However, training the deep learning classifiers with word embedding increases the performance of ML models, where the BiLSTM model reaches an accuracy of 93.6% and a 0.93 F1 score.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper developed a text-mining method for chemical accident cases based on word embedding and deep learning, and the developed text-classification model could classify different types of accidents as fires, explosions, poisoning, and others.
Abstract: Accident precursors can provide valuable clues for risk assessment and risk warning. Trends such as the main characteristics, common causes, and high-frequency types of chemical accidents can provide references for formulating safety-management strategies. However, such information is usually documented in unstructured or semistructured free text related to chemical accident cases, and it can be costly to manually extract the information. Recently, text-mining methods based on deep learning have been shown to be very effective. This study, therefore, developed a text-mining method for chemical accident cases based on word embedding and deep learning. First, the word2vec model was used to obtain word vectors from a text corpus of chemical accident cases. Then, a bidirectional long short-term memory (LSTM) model with an attention mechanism was constructed to classify the types and causes of Chinese chemical accident cases. The case studies revealed the following results: 1) Common trends in chemical accidents (e.g., characteristics, causes, high-frequency types) could be obtained through correlation analysis based on word embedding; 2) The developed text-classification model could classify different types of accidents as fires, explosions, poisoning, and others, and the average p (73.1%) and r (72.5%) of the model achieved ideal performance for Chinese text classification; 3) The developed text-classification model could classify the causes of accidents as personal unsafe act, personal habitual behavior, unsafe conditions of equipment or materials and vulnerabilities management strategy; p and r were 63.6% for the causes of vulnerabilities management strategy, and the average p and r are both 60.7%; 4) the accident precursors of explosion, fire, and poisoning were obtained through correlation analyses of each high-frequency type of chemical accident case based on text classification; 5) the text-mining method can provide site managers with an efficient tool for extracting useful insights from chemical accident cases based on word embedding and deep learning.

Journal ArticleDOI
TL;DR: The authors proposed a dynamic embedding projection-gated convolutional neural network (DEP-CNN) for multi-class and multi-label text classification, which transforms and carries word information by using gating units and shortcut connections to control how much context information is incorporated into each specific position of a word embedding matrix in a text.
Abstract: Text classification is a fundamental and important area of natural language processing for assigning a text into at least one predefined tag or category according to its content. Most of the advanced systems are either too simple to get high accuracy or centered on using complex structures to capture the genuinely required category information, which requires long time to converge during their training stage. In order to address such challenging issues, we propose a dynamic embedding projection-gated convolutional neural network (DEP-CNN) for multi-class and multi-label text classification. Its dynamic embedding projection gate (DEPG) transforms and carries word information by using gating units and shortcut connections to control how much context information is incorporated into each specific position of a word-embedding matrix in a text. To our knowledge, we are the first to apply DEPG over a word-embedding matrix. The experimental results on four known benchmark datasets display that DEP-CNN outperforms its recent peers.

Journal ArticleDOI
TL;DR: In this article, a hybrid embedding-based text representation for hierarchical multi-label text classification (HMTC) was proposed, which consists of both graph embeddings of categories in the hierarchy and their word embedding of category labels.
Abstract: Many real-world text classification tasks often deal with a large number of closely related categories organized in a hierarchical structure or taxonomy. Hierarchical multi-label text classification (HMTC) has become rather challenging when it requires handling large sets of closely related categories. The structural features of all categories in the entire hierarchy and the word semantics of their category labels are very helpful in improving text classification accuracy over large sets of closely related categories, which has been neglected in most of existing HMTC approaches. In this paper, we present a hybrid embedding-based text representation for HMTC with high accuracy. First, the hybrid embedding consists of both graph embedding of categories in the hierarchy and their word embedding of category labels. The Structural Deep Network Embedding-based graph embedding model is used to simultaneously encode the global and local structural features of a given category in the whole hierarchy for making the category structurally discriminable. We further use the word embedding technique to encode the word semantics of each category label in the hierarchy for making different categories semantically discriminable. Second, we presented a level-by-level HMTC approach based on the bidirectional Gated Recurrent Unit network model together with the hybrid embedding that is used to learn the representation of the text level-by-level. Last but not least, extensive experiments were made over five large-scale real-world datasets in comparison with the state-of-the-art hierarchical and flat multi-label text classification approaches, and the experimental results show that our approach is very competitive to the state-of-the-art approaches in classification accuracy, in particular maintaining computational costs while achieving superior performance.

Journal ArticleDOI
TL;DR: In this article , a hybrid embedding-based text representation for hierarchical multi-label text classification (HMTC) was proposed, which consists of both graph embeddings of categories in the hierarchy and their word embedding of category labels.
Abstract: Many real-world text classification tasks often deal with a large number of closely related categories organized in a hierarchical structure or taxonomy. Hierarchical multi-label text classification (HMTC) has become rather challenging when it requires handling large sets of closely related categories. The structural features of all categories in the entire hierarchy and the word semantics of their category labels are very helpful in improving text classification accuracy over large sets of closely related categories, which has been neglected in most of existing HMTC approaches. In this paper, we present a hybrid embedding-based text representation for HMTC with high accuracy. First, the hybrid embedding consists of both graph embedding of categories in the hierarchy and their word embedding of category labels. The Structural Deep Network Embedding-based graph embedding model is used to simultaneously encode the global and local structural features of a given category in the whole hierarchy for making the category structurally discriminable. We further use the word embedding technique to encode the word semantics of each category label in the hierarchy for making different categories semantically discriminable. Second, we presented a level-by-level HMTC approach based on the bidirectional Gated Recurrent Unit network model together with the hybrid embedding that is used to learn the representation of the text level-by-level. Last but not least, extensive experiments were made over five large-scale real-world datasets in comparison with the state-of-the-art hierarchical and flat multi-label text classification approaches, and the experimental results show that our approach is very competitive to the state-of-the-art approaches in classification accuracy, in particular maintaining computational costs while achieving superior performance.

Journal ArticleDOI
TL;DR: DepecheMood++ as mentioned in this paper is an extension of an existing and widely used emotion lexicon for English and a novel version of the lexicon, targeting Italian, which can be used to boost performance on datasets and tasks of varying degree of domain-specificity.
Abstract: Several lexica for sentiment analysis have been developed; while most of these come with word polarity annotations (e.g., positive/negative), attempts at building lexica for finer-grained emotion analysis (e.g., happiness, sadness) have recently attracted significant attention. They are often exploited as a building block for developing emotion recognition learning models, and/or used as baselines to which the performance of the models can be compared. In this work, we contribute two new resources, that we call DepecheMood++ (DM++): a) an extension of an existing and widely used emotion lexicon for English; and b) a novel version of the lexicon, targeting Italian. Furthermore, we show how simple techniques can be used, both in supervised and unsupervised experimental settings, to boost performance on datasets and tasks of varying degree of domain-specificity. Also, we report an extensive comparative analysis against other available emotion lexica and state-of-the-art supervised approaches, showing that DepecheMood++ emerges as the best-performing non-domain-specific lexicon in unsupervised settings. We also observe that simple learning models on top of DM++ can provide more challenging baselines. We finally introduce embedding-based methodologies to perform a) vocabulary expansion to address data scarcity and b) vocabulary porting to new languages in case training data is not available.

Journal ArticleDOI
TL;DR: In this article , a FastText-based word embedding strategy has been employed to represent each peptide sample via a skip-gram model, and the deep neural network (DNN) model was applied to accurately discriminate the ACPs.

Journal ArticleDOI
01 Jan 2022-Array
TL;DR: In this article , the authors proposed DL models for sentiment analysis in Bangla text using an extended lexicon data dictionary (LDD) and implemented the rule-based method Bangla Text Sentiment Score (BTSC) algorithm for extracting polarity from large texts.

Journal ArticleDOI
TL;DR: The authors used a deep learning method with several architectures such as CNN, Bidirectional LSTM, and ResNet, combined with pre-trained word embedding, trained using four different datasets.

Journal ArticleDOI
TL;DR: Zhang et al. as mentioned in this paper proposed a topic-aware extractive and abstractive summarization model named T-BERTSum, based on Bidirectional Encoder Representations from Transformers (BERTs), which can simultaneously infer topics and generate summarization from social texts.
Abstract: In the era of social networks, the rapid growth of data mining in information retrieval and natural language processing makes automatic text summarization necessary. Currently, pretrained word embedding and sequence to sequence models can be effectively adapted in social network summarization to extract significant information with strong encoding capability. However, how to tackle the long text dependence and utilize the latent topic mapping has become an increasingly crucial challenge for these models. In this article, we propose a topic-aware extractive and abstractive summarization model named T-BERTSum, based on Bidirectional Encoder Representations from Transformers (BERTs). This is an improvement over previous models, in which the proposed approach can simultaneously infer topics and generate summarization from social texts. First, the encoded latent topic representation, through the neural topic model (NTM), is matched with the embedded representation of BERT, to guide the generation with the topic. Second, the long-term dependencies are learned through the transformer network to jointly explore topic inference and text summarization in an end-to-end manner. Third, the long short-term memory (LSTM) network layers are stacked on the extractive model to capture sequence timing information, and the effective information is further filtered on the abstractive model through a gated network. In addition, a two-stage extractive–abstractive model is constructed to share the information. Compared with the previous work, the proposed model T-BERTSum focuses on pretrained external knowledge and topic mining to capture more accurate contextual representations. Experimental results on the CNN/Daily mail and XSum datasets demonstrate that our proposed model achieves new state-of-the-art results while generating consistent topics compared with the most advanced method.

Journal ArticleDOI
TL;DR: In this article , the authors proposed a transformer-based sentiment analysis for cross-domain sentiment analysis, which considers temporal relationships between consecutive snapshots of informative market data and mood time series for market price prediction.
Abstract: Real-time market prediction tool tracking public opinion in specialized newsgroups and informative market data persuades investors of financial markets. Previous works mainly used lexicon-based sentiment analysis for financial markets prediction, while recently proposed transformer-based sentiment analysis promise good results for cross-domain sentiment analysis. This work considers temporal relationships between consecutive snapshots of informative market data and mood time series for market price prediction. We calculate the sentiment mood time series via the probability distribution of news embedding generated through a BERT-based transformer language model fine-tuned for financial domain sentiment analysis. We then use a deep recurrent neural network for feature extraction followed by a dense layer for price regression. We implemented our approach as an open-source API for real-time price regression. We build a corpus of financial news related to currency pairs in foreign exchange and Cryptocurrency markets. We further augment our model with informative technical indicators and news sentiment scores aligned based on news release timestamp. Results of our experiments show significant error reduction compared to the baselines. Our Financial News and Financial Sentiment Analysis RESTFul APIs are available for public use.

Journal ArticleDOI
TL;DR: The issue of polarization of Twitter sentiments, which is one of the major areas of concern regarding sentiment analysis, is addressed and the proposed approach consists of classifying the sentiments using emotions from Plutchik’s wheel of emotion, which provides eight basic emotions to make the tasks more approachable.

Journal ArticleDOI
TL;DR: In this article , the authors used an NLP-based method to create the embedding of emotional lexicons applying attention-based deep clustering, and then used the learned representation is used to visualize the emotional aspect of the text authorized by patients.

Journal ArticleDOI
Min Zhang1
TL;DR: In this paper , a dataset of COVID-19 Twitter posts from nine states in the United States for fifteen days (from 1 April 2020, to 15 April 2020) was used to analyze user sentiment.
Abstract: The novel coronavirus disease (COVID-19) has dramatically affected people’s daily lives worldwide. More specifically, since there is still insufficient access to vaccines and no straightforward, reliable treatment for COVID-19, every country has taken the appropriate precautions (such as physical separation, masking, and lockdown) to combat this extremely infectious disease. As a result, people invest much time on online social networking platforms (e.g., Facebook, Reddit, LinkedIn, and Twitter) and express their feelings and thoughts regarding COVID-19. Twitter is a popular social networking platform, and it enables anyone to use tweets. This research used Twitter datasets to explore user sentiment from the COVID-19 perspective. We used a dataset of COVID-19 Twitter posts from nine states in the United States for fifteen days (from 1 April 2020, to 15 April 2020) to analyze user sentiment. We focus on exploiting machine learning (ML), and deep learning (DL) approaches to classify user sentiments regarding COVID-19. First, we labeled the dataset into three groups based on the sentiment values, namely positive, negative, and neutral, to train some popular ML algorithms and DL models to predict the user concern label on COVID-19. Additionally, we have compared traditional bag-of-words and term frequency-inverse document frequency (TF-IDF) for representing the text to numeric vectors in ML techniques. Furthermore, we have contrasted the encoding methodology and various word embedding schemes, such as the word to vector (Word2Vec) and global vectors for word representation (GloVe) versions, with three sets of dimensions (100, 200, and 300) for representing the text to numeric vectors for DL approaches. Finally, we compared COVID-19 infection cases and COVID-19-related tweets during the COVID-19 pandemic.


Journal ArticleDOI
TL;DR: In this paper , the authors developed a semantic thesaurus for construction terms including 208 word-replacement rules based on Word2Vec embedding to understand the different vocabularies.