scispace - formally typeset
Search or ask a question
Topic

Word embedding

About: Word embedding is a research topic. Over the lifetime, 4683 publications have been published within this topic receiving 153378 citations. The topic is also known as: word embeddings.


Papers
More filters
Journal ArticleDOI
TL;DR: Unsupervised features retrieved using Structured Skip-gram model contributes to the reason for achieving better performance in the FIRE2015 entity extraction task.

23 citations

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors proposed a shuffling strategy to transform related words and APIs into tuples to address the alignment challenge, using these tuples, Word2API models words and API simultaneously.
Abstract: Developers increasingly rely on text matching tools to analyze the relation between natural language words and APIs. However, semantic gaps, namely textual mismatches between words and APIs, negatively affect these tools. Previous studies have transformed words or APIs into low-dimensional vectors for matching; however, inaccurate results were obtained due to the failure of modeling words and APIs simultaneously. To resolve this problem, two main challenges are to be addressed: the acquisition of massive words and APIs for mining and the alignment of words and APIs for modeling. Therefore, this study proposes Word2API to effectively estimate relatedness of words and APIs. Word2API collects millions of commonly used words and APIs from code repositories to address the acquisition challenge. Then, a shuffling strategy is used to transform related words and APIs into tuples to address the alignment challenge. Using these tuples, Word2API models words and APIs simultaneously. Word2API outperforms baselines by 10-49.6 percent of relatedness estimation in terms of precision and NDCG. Word2API is also effective on solving typical software tasks, e.g., query expansion and API documents linking. A simple system with Word2API-expanded queries recommends up to 21.4 percent more related APIs for developers. Meanwhile, Word2API improves comparison algorithms by 7.9-17.4 percent in linking questions in Question&Answer communities to API documents.

23 citations

Journal ArticleDOI
TL;DR: This study shows that relationships between geographic and semantic spaces arise when the authors apply word embedding models over a corpus of documents in Mexican Spanish, and achieves high accuracy for geographic named entity recognition in Spanish.
Abstract: In recent years, dense word embeddings for text representation have been widely used since they can model complex semantic and morphological characteristics of language, such as meaning in specific contexts and applications. Contrary to sparse representations, such as one-hot encoding or frequencies, word embeddings provide computational advantages and improvements on the results in many natural language processing tasks, similar to the automatic extraction of geospatial information. Computer systems capable of discovering geographic information from natural language involve a complex process called geoparsing. In this work, we explore the use of word embeddings for two NLP tasks: Geographic Named Entity Recognition and Geographic Entity Disambiguation, both as an effort to develop the first Mexican Geoparser. Our study shows that relationships between geographic and semantic spaces arise when we apply word embedding models over a corpus of documents in Mexican Spanish. Our models achieved high accuracy for geographic named entity recognition in Spanish.

23 citations

Journal ArticleDOI
TL;DR: This work provides a promising approach to simplify DPLs without using terminological resources or parallel corpora and proposes the use of word embeddings to identify the simplest synonym for a given term.
Abstract: Drug Package Leaflets (DPLs) provide information for patients on how to safely use medicines Pharmaceutical companies are responsible for producing these documents However, several studies have shown that patients usually have problems in understanding sections describing posology (dosage quantity and prescription), contraindications and adverse drug reactions An ultimate goal of this work is to provide an automatic approach that helps these companies to write drug package leaflets in an easy-to-understand language Natural language processing has become a powerful tool for improving patient care and advancing medicine because it leads to automatically process the large amount of unstructured information needed for patient care However, to the best of our knowledge, no research has been done on the automatic simplification of drug package leaflets In a previous work, we proposed to use domain terminological resources for gathering a set of synonyms for a given target term A potential drawback of this approach is that it depends heavily on the existence of dictionaries, however these are not always available for any domain and language or if they exist, their coverage is very scarce To overcome this limitation, we propose the use of word embeddings to identify the simplest synonym for a given term Word embedding models represent each word in a corpus with a vector in a semantic space Our approach is based on assumption that synonyms should have close vectors because they occur in similar contexts In our evaluation, we used the corpus EasyDPL (Easy Drug Package Leaflets), a collection of 306 leaflets written in Spanish and manually annotated with 1400 adverse drug effects and their simplest synonyms We focus on leaflets written in Spanish because it is the second most widely spoken language on the world, but as for the existence of terminological resources, the Spanish language is usually less prolific than the English language Our experiments show an accuracy of 385% using word embeddings This work provides a promising approach to simplify DPLs without using terminological resources or parallel corpora Moreover, it could be easily adapted to different domains and languages However, more research efforts are needed to improve our approach based on word embedding because it does not overcome our previous work using dictionaries yet

23 citations

Posted Content
TL;DR: This paper presents the first multi-modal framework for evaluating English word representations based on cognitive lexical semantics, and finds strong correlations in the results between cognitive datasets, across recording modalities and to their performance on extrinsic NLP tasks.
Abstract: An interesting method of evaluating word representations is by how much they reflect the semantic representations in the human brain. However, most, if not all, previous works only focus on small datasets and a single modality. In this paper, we present the first multi-modal framework for evaluating English word representations based on cognitive lexical semantics. Six types of word embeddings are evaluated by fitting them to 15 datasets of eye-tracking, EEG and fMRI signals recorded during language processing. To achieve a global score over all evaluation hypotheses, we apply statistical significance testing accounting for the multiple comparisons problem. This framework is easily extensible and available to include other intrinsic and extrinsic evaluation methods. We find strong correlations in the results between cognitive datasets, across recording modalities and to their performance on extrinsic NLP tasks.

23 citations


Network Information
Related Topics (5)
Recurrent neural network
29.2K papers, 890K citations
87% related
Unsupervised learning
22.7K papers, 1M citations
86% related
Deep learning
79.8K papers, 2.1M citations
85% related
Reinforcement learning
46K papers, 1M citations
84% related
Graph (abstract data type)
69.9K papers, 1.2M citations
84% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
2023317
2022716
2021736
20201,025
20191,078
2018788