scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Journal ArticleDOI
TL;DR: In this article, the authors presented a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences, and treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices.
Abstract: Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

69 citations

Posted Content
Chengyue Gong1, Di He1, Xu Tan2, Tao Qin2, Liwei Wang1, Tie-Yan Liu2 
TL;DR: This paper develops a neat, simple yet effective way to learn FRequency-AGnostic word Embedding (FRAGE) using adversarial training and shows that with FRAGE, the model achieves higher performance than the baselines in all tasks.
Abstract: Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models In this paper, we develop a neat, simple yet effective way to learn \emph{FRequency-AGnostic word Embedding} (FRAGE) using adversarial training We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification Results show that with FRAGE, we achieve higher performance than the baselines in all tasks

69 citations

Journal ArticleDOI
Xiangjie Kong1, Mengyi Mao1, Wei Wang1, Jiaying Liu1, Bo Xu1 
TL;DR: Through the APS data set, it is shown that VOPRec outperforms state-of-the-art paper recommendation baselines measured by precision, recall, F1, and NDCG.
Abstract: Finding relevant papers is a non-trivial problem for scholars due to the tremendous amount of academic information in the era of scholarly big data. Scientific paper recommendation systems have been developed to solve such problem by recommending relevant papers to scholars. However, previous paper recommendations calculate paper similarity based on hand-engineered features which are inflexible. To address this problem, we develop a scientific paper recommendation system, namely VOPRec, by vector representation learning of paper in citation networks. VOPRec takes advantages of recent research in both text and network representation learning for unsupervised feature design. In VOPRec, the text information is represented with word embedding to find papers of similar research interest. Then, the structural identity is converted into vectors to find papers of similar network topology. After bridging text information and structural identity with the citation network, vector representation of paper can be learned with network embedding. Finally, top- $Q$ Q recommendation list is generated based on the similarity calculated with paper vectors. Through the APS data set, we show that VOPRec outperforms state-of-the-art paper recommendation baselines measured by precision, recall, F1, and NDCG.

69 citations

Journal ArticleDOI
TL;DR: This paper develops a word embedding based text summarization, and it is shown that Word2Vec representation gives better results than traditional BOW representation, and proposes three ensemble techniques that improve the quality of ATS.
Abstract: The vast amounts of data being collected and analyzed have led to invaluable source of information, which needs to be easily handled by humans. Automatic Text Summarization (ATS) systems enable users to get the gist of information and knowledge in a short time in order to make critical decisions quickly. Deep neural networks have proven their ability to achieve excellent performance in many real-world Natural Language Processing and computer vision applications. However, it still lacks attention in ATS. The key problem of traditional applications is that they involve high dimensional and sparse data, which makes it difficult to capture relevant information. One technique for overcoming these problems is learning features via dimensionality reduction. On the other hand, word embedding is another neural network technique that generates a much more compact word representation than a traditional Bag-of-Words (BOW) approach. In this paper, we are seeking to enhance the quality of ATS by integrating unsupervised deep neural network techniques with word embedding approach. First, we develop a word embedding based text summarization, and we show that Word2Vec representation gives better results than traditional BOW representation. Second, we propose other models by combining word2vec and unsupervised feature learning methods in order to merge information from different sources. We show that unsupervised neural networks models trained on Word2Vec representation give better results than those trained on BOW representation. Third, we also propose three ensemble techniques. The first ensemble combines BOW and word2vec using a majority voting technique. The second ensemble aggregates the information provided by the BOW approach and unsupervised neural networks. The third ensemble aggregates the information provided by Word2Vec and unsupervised neural networks. We show that the ensemble methods improve the quality of ATS, in particular the ensemble based on word2vec approach gives better results. Finally, we perform different experiments to evaluate the performance of the investigated models. We use two kind of datasets that are publically available for evaluating ATS task. Results of statistical studies affirm that word embedding-based models outperform the summarization task compared to those based on BOW approach. In particular, ensemble learning technique with Word2Vec representation surpass all the investigated models.

68 citations

Posted Content
TL;DR: A joint model for performing unsupervised morphological analysis on words, and learning a character-level composition function from morphemes to word embeddings, which is comparable to dedicated morphological analyzers at the task of morpheme boundary recovery and performs better than word-based embedding models at thetask of syntactic analogy answering.
Abstract: This paper presents a joint model for performing unsupervised morphological analysis on words, and learning a character-level composition function from morphemes to word embeddings. Our model splits individual words into segments, and weights each segment according to its ability to predict context words. Our morphological analysis is comparable to dedicated morphological analyzers at the task of morpheme boundary recovery, and also performs better than word-based embedding models at the task of syntactic analogy answering. Finally, we show that incorporating morphology explicitly into character-level models help them produce embeddings for unseen words which correlate better with human judgments.

68 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788