Word Embedding based Generalized Language Model for Information Retrieval

doi:10.1145/2766462.2767780

Proceedings ArticleDOI

Word Embedding based Generalized Language Model for Information Retrieval

Debasis Ganguly, +3 more

- pp 795-798

Chats0

TLDR

A generalized language model is constructed, where the mutual independence between a pair of words (say t and t') no longer holds and the vector embeddings of the words are made use of to derive the transformation probabilities between words.

Abstract:

Word2vec, a state-of-the-art word embedding technique has gained a lot of interest in the NLP community. The embedding of the word vectors helps to retrieve a list of words that are used in similar contexts with respect to a given word. In this paper, we focus on using the word embeddings for enhancing retrieval effectiveness. In particular, we construct a generalized language model, where the mutual independence between a pair of words (say t and t') no longer holds. Instead, we make use of the vector embeddings of the words to derive the transformation probabilities between words. Specifically, the event of observing a term t in the query from a document d is modeled by two distinct events, that of generating a different term t', either from the document itself or from the collection, respectively, and then eventually transforming it to the observed query term t. The first event of generating an intermediate term from the document intends to capture how well does a term contextually fit within a document, whereas the second one of generating it from the collection aims to address the vocabulary mismatch problem by taking into account other related terms in the collection. Our experiments, conducted on the standard TREC collection, show that our proposed method yields significant improvements over LM and LDA-smoothed LM baselines.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Learning to Match using Local and Distributed Representations of Text for Web Search

Bhaskar Mitra, +2 more

TL;DR: This work proposes a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that Matching with distributed representations complements matching with traditional local representations.

...read moreread less

Journal ArticleDOI

BioWordVec, improving biomedical word embeddings with subword information and MeSH.

Yijia Zhang, +5 more

- 10 May 2019 -

Scientific Data

TL;DR: This work presents BioWordVec: an open set of biomedical word vectors/embeddings that combines subword information from unlabeled biomedical text with a widely-used biomedical controlled vocabulary called Medical Subject Headings (MeSH).

...read moreread less

Posted Content

Pretrained Transformers for Text Ranking: BERT and Beyond

Jimmy Lin, +2 more

- 13 Oct 2020 -

arXiv: Information Retrieval

TL;DR: This tutorial provides an overview of text ranking with neural network architectures known as transformers, of which BERT (Bidirectional Encoder Representations from Transformers) is the best-known example, and covers a wide range of techniques.

...read moreread less

Journal ArticleDOI

A comparison of word embeddings for the biomedical natural language processing

Yanshan Wang, +7 more

- 01 Nov 2018 -

Journal of Biomedical Informatics

TL;DR: The qualitative evaluation shows that the word embeddings trained from EHR and MedLit can find more similar medical terms than those trained from GloVe and Google News, and the intrinsic quantitative evaluation verifies that the semantic similarity captured by the wordEmbedded is closer to human experts' judgments on all four tested datasets.

...read moreread less

Book

An Introduction to Neural Information Retrieval

Bhaskar Mitra, +1 more

TL;DR: The monograph provides a complete picture of neural information retrieval techniques that culminate in supervised neural learning to rank models including deep neural network architectures that are trained end-to-end for ranking tasks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Journal ArticleDOI

Indexing by Latent Semantic Analysis

Scott Deerwester, +4 more

- 01 Sep 1990 -

Journal of the Association for Informati...

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Posted Content

Distributed Representations of Words and Phrases and their Compositionality

Tomas Mikolov, +4 more

- 16 Oct 2013 -

arXiv: Computation and Language

TL;DR: In this paper, the Skip-gram model is used to learn high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships and improve both the quality of the vectors and the training speed.

...read moreread less