Probabilistic correlation-based similarity measure on text records q

Open Access

Probabilistic correlation-based similarity measure on text records q

Chats0

TLDR

In this article, a probabilistic correlation-based similarity measure was proposed for unstructured text record similarity evaluation, which enriches the information of records by considering correlations of tokens.

Abstract:

Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

Probabilistic correlation-based similarity measure on text records q

Citations

A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

Time Series Data Cleaning: A Survey

Framework for syntactic string similarity measures

Time Series Data Cleaning: A Survey

A Survey of Approximate Quantile Computation on Large-Scale Data

References

Indexing by Latent Semantic Analysis

Probabilistic latent semantic indexing

An Information-Theoretic Definition of Similarity

Automatic text processing: the transformation, analysis, and retrieval of information by computer

Using Information Content to Evaluate Semantic Similarity in a Taxonomy

Related Papers (5)

Probabilistic correlation-based similarity measure on text records

A Probabilistic Approach to Full-Text Document Clustering

Estimating Probability Density of Content Types for Promoting Medical Records Search

Adaptive ranking system for information retrieval

System and method for handling the confounding effect of document length on vector-based similarity scores