Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.

...read moreread less

Papers published on a yearly basis

Papers

PDF

Open Access

More filters

Journal Article•DOI•

Threatening Language Detection and Target Identification in Urdu Tweets

[...]

Maaz Amjad¹, Noman Ashraf¹, Alisa Zhila, Grigori Sidorov¹, Arkaitz Zubiaga², Alexander Gelbukh¹ - Show less +2 more•Institutions (2)

Instituto Politécnico Nacional¹, Queen Mary University of London²

14 Sep 2021-IEEE Access

TL;DR: In this article, the authors proposed a new dataset for threatening language detection in Urdu tweets, which contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening.

...read moreread less

Abstract: Automatic detection of threatening language is an important task, however, most of the existing studies focused on English as the target language, with limited work on low-resource languages. In this paper, we introduce and release a new dataset for threatening language detection in Urdu tweets to further research in this language. The proposed dataset contains 3,564 tweets manually annotated by human experts as either threatening or non-threatening. The threatening tweets are further classified by the target into one of two types: threatening to an individual person or threatening to a group. This research follows a two-step approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character $n$ -gram counts or word $n$ -gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that an MLP classifier with the combination of word $n$ -gram features outperformed other classifiers in detecting threatening tweets. Further, an SVM classifier using fastText pre-trained word embedding obtained the best results for the target identification task.

...read moreread less

26 citations

Proceedings Article•DOI•

URL2Vec: URL Modeling with Character Embeddings for Fast and Accurate Phishing Website Detection

[...]

Yuan Huaping¹, Zhenguo Yang¹, Chen Xu¹, Li Yukun¹, Wenyin Liu¹ - Show less +1 more•Institutions (1)

Guangdong University of Technology¹

01 Dec 2018

TL;DR: A deep learning-based approach to phishing detection is proposed, where websites' URLs and the characters in these URLs are mapped to documents and words, respectively, in the context of word2vec-based word embedding learning, to obtain the vector representations of the URLs.

...read moreread less

Abstract: A deep learning-based approach to phishing detection is proposed. Specifically, websites' URLs and the characters in these URLs are mapped to documents and words, respectively, in the context of word2vec-based word embedding learning. Consequently, character embedding can be achieved from a corpus of URLs in an unsupervised manner. Furthermore, we combine character embedding with the structures of URLs to obtain the vector representations of the URLs. In particular, an URL is partitioned into the following five sections: URL protocol, sub-domain name, domain name, domain suffix, and URL path. To identify the phishing URLs, existing classification algorithms can be used smoothly on the vector representations of the URLs, avoiding laborious work on designing effective features manually and empirically. For evaluations, we collect a large-scale dataset, i.e., 1 Million Phishing Detection Dataset (1M-PD), which has been released for public use. Extensive experiments conducted on two real-world datasets show the effectiveness of the proposed approach, which achieves an accuracy of 99.69% with 0.40% false positive and 99.79% true positives on the 1M-PD dataset. In particular, the proposed approach detects each URL in 32ms on average merely on a personal computer, which is much faster than existing approaches and even can be considered real-time.

...read moreread less

26 citations

Patent•

Text classification by ranking with convolutional neural networks

[...]

Cicero Nogueira dos Santos¹, Bing Xiang¹, Bowen Zhou¹•Institutions (1)

IBM¹

21 Apr 2016

TL;DR: In this article, a CNN is trained based on a set of training data, which includes learning parameters of class distributed vector representations (DVRs) of each of the predefined set of classes.

...read moreread less

Abstract: According to an aspect a method includes configuring a convolutional neural network (CNN) for classifying text based on word embedding features into a predefined set of classes identified by class labels. The predefined set of classes includes a class labeled none-of-the-above for text that does not fit into any of the other classes in the predefined set of classes. The CNN is trained based on a set of training data. The training includes learning parameters of class distributed vector representations (DVRs) of each of the predefined set of classes. The learning includes minimizing a pair-wise ranking loss function over the set of training data. A class embedding matrix of the class DVRs of the predefined set of classes that excludes a class embedding for the none-of-the-above class is generated. Each column in the class embedding matrix corresponds to one of the predefined classes.

...read moreread less

26 citations

Proceedings Article•DOI•

USAAR-WLV: Hypernym Generation with Deep Neural Nets

[...]

Liling Tan¹, Rohit Gupta², Josef van Genabith³•Institutions (3)

Saarland University¹, University of Wolverhampton², German Research Centre for Artificial Intelligence³

01 Jun 2015

TL;DR: This paper describes the USAAR-WLV taxonomy induction system that participated in the Taxonomy Extraction Evaluation task of SemEval-2015 and extends prior work on using vector space word embedding models for hypernym-hyponym extraction by simplifying the means to extract a projection matrix that transforms any hyponym to itshypernym.

...read moreread less

Abstract: This paper describes the USAAR-WLV taxonomy induction system that participated in the Taxonomy Extraction Evaluation task of SemEval-2015. We extend prior work on using vector space word embedding models for hypernym-hyponym extraction by simplifying the means to extract a projection matrix that transforms any hyponym to its hypernym. This is done by making use of function words, which are usually overlooked in vector space approaches to NLP. Our system performs best in the chemical domain and has achieved competitive results in the overall evaluations.

...read moreread less

26 citations

Journal Article•DOI•

Learning Word Embeddings with Chi-Square Weights for Healthcare Tweet Classification

[...]

Sicong Kuang, Brian D. Davison

17 Aug 2017-Applied Sciences

TL;DR: This work introduces two algorithms based on an existing word embedding learning algorithm: the continuous bag-of-words model (CBOW) and learns weights based on the words’ relative importance in the classification task, which outperform existing techniques by a relative accuracy improvement.

...read moreread less

Abstract: Twitter is a popular source for the monitoring of healthcare information and public disease. However, there exists much noise in the tweets. Even though appropriate keywords appear in the tweets, they do not guarantee the identification of a truly health-related tweet. Thus, the traditional keyword-based classification task is largely ineffective. Algorithms for word embeddings have proved to be useful in many natural language processing (NLP) tasks. We introduce two algorithms based on an existing word embedding learning algorithm: the continuous bag-of-words model (CBOW). We apply the proposed algorithms to the task of recognizing healthcare-related tweets. In the CBOW model, the vector representation of words is learned from their contexts. To simplify the computation, the context is represented by an average of all words inside the context window. However, not all words in the context window contribute equally to the prediction of the target word. Greedily incorporating all the words in the context window will largely limit the contribution of the useful semantic words and bring noisy or irrelevant words into the learning process, while existing word embedding algorithms also try to learn a weighted CBOW model. Their weights are based on existing pre-defined syntactic rules while ignoring the task of the learned embedding. We propose learning weights based on the words’ relative importance in the classification task. Our intuition is that such learned weights place more emphasis on words that have comparatively more to contribute to the later task. We evaluate the embeddings learned from our algorithms on two healthcare-related datasets. The experimental results demonstrate that embeddings learned from the proposed algorithms outperform existing techniques by a relative accuracy improvement of over 9%.

...read moreread less

26 citations

Collapse

Network Information

Performance

Metrics

5,718

Papers

201,647

Citations

No. of papers in the topic in previous years
Year	Papers
2023	317
2022	716
2021	736
2020	1,025
2019	1,078
2018	788

Word embedding

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics