scispace - formally typeset
Open Access

Probabilistic correlation-based similarity measure on text records q

Reads0
Chats0
TLDR
In this article, a probabilistic correlation-based similarity measure was proposed for unstructured text record similarity evaluation, which enriches the information of records by considering correlations of tokens.
Abstract
Large scale unstructured text records are stored in text attributes in databases and information systems, such as scientific citation records or news highlights. Approximate string matching techniques for full text retrieval, e.g., edit distance and cosine similarity, can be adopted for unstructured text record similarity evaluation. However, these techniques do not show the best performance when applied directly, owing to the difference between unstructured text records and full text. In particular, the information are limited in text records of short length, and various information formats such as abbreviation and data missing greatly affect the record similarity evaluation. In this paper, we propose a novel probabilistic correlation-based similarity measure. Rather than simply conducting the matching of tokens between two records, our similarity evaluation enriches the information of records by considering correlations of tokens. The probabilistic correlation between tokens is defined as the probability of them appearing together in the same records. Then we compute weights of tokens and discover correlations of records based on the probabilistic correlations of tokens. The extensive experimental results demonstrate the effectiveness of our proposed approach.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

A novel similarity/dissimilarity measure for intuitionistic fuzzy sets and its application in pattern recognition

TL;DR: A novel knowledge-based similarity/dissimilarity measure between IFSs is proposed and it is demonstrated that the proposed measure is the most reliable to deal with the pattern recognition problem in comparison with the existing similarity measures.
Journal ArticleDOI

Time Series Data Cleaning: A Survey

TL;DR: In this article, the authors provide a classification of time series data cleaning techniques and comprehensively review the state-of-the-art methods of each type and highlight possible directions of time-series data cleaning.
Journal ArticleDOI

Framework for syntactic string similarity measures

TL;DR: This paper introduces a general framework of syntactic similarity measures for matching short text by dividing them into three components: character-level similarity, string segmentation, and matching technique, and provides an open-source Java toolkit of the proposed framework.
Posted Content

Time Series Data Cleaning: A Survey

TL;DR: This survey provides a classification of time series data cleaning techniques and comprehensively reviews the state-of-the-art methods of each type and highlights possible directions time seriesData cleaning.
Journal ArticleDOI

A Survey of Approximate Quantile Computation on Large-Scale Data

TL;DR: This paper focuses on an order statistic, quantiles, and presents a comprehensive analysis of studies on approximate quantile computation, covering both deterministic algorithms and randomized algorithms that compute approximate quantiles over streaming models or distributed models.
References
More filters
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Journal ArticleDOI

Probabilistic latent semantic indexing

TL;DR: Probabilistic Latent Semantic Indexing is a novel approach to automated document indexing which is based on a statistical latent class model for factor analysis of count data.
Proceedings Article

An Information-Theoretic Definition of Similarity

Dekang Lin
TL;DR: This work presents an informationtheoretic definition of similarity that is applicable as long as there is a probabilistic model and demonstrates how this definition can be used to measure the similarity in a number of different domains.
Posted Content

Using Information Content to Evaluate Semantic Similarity in a Taxonomy

TL;DR: In this article, a new measure of semantic similarity in an IS-A taxonomy based on the notion of information content is presented, and experimental evaluation suggests that the measure performs encouragingly well (a correlation of r = 0.79 with a benchmark set of human similarity judgments, with an upper bound of r < 0.90 for human subjects performing the same task).