String similarity algorithms for a ticket classification system

doi:10.1109/CODIT.2019.8820497

Proceedings ArticleDOI

String similarity algorithms for a ticket classification system

Malgorzata Pikies, +1 more

- pp 36-41

Chats0

TLDR

This work considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed, leading to a 15% improvement in the ratio of false positives to true positive classifications.

Abstract:

Fuzzy string matching allows for close, but not exactly, matching strings to be compared and extracted from bodies of text. As such, they are useful in systems which automatically extract and process documents. We summarise and compare various existing algorithms for achieving string similarity measures: Longest Common Subsequence (LCS), Dice coefficient, Cosine Similarity, Levenshtein distance and Damerau distance. Based on previously classified customer support enquiries (tickets), we considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest (such as error phrases, product names and warning messages) in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed. An optimal algorithm selection is made based on novel studies of the aforementioned similarity measures on text strings tokenised into characters. Such analysis also allowed for an optimum similarity threshold to be identified for various categories of enquiries, to reduce mismatched strings whilst allowing optimal coverage of the correctly matched key phrases. This led to a 15% improvement in the ratio of false positives to true positive classifications over the existing approach used by a customer support system.

String similarity algorithms for a ticket classification system

Citations

Analysis and safety engineering of fuzzy string matching algorithms.

Using String-Comparison Measures to Improve and Evaluate Collaborative Filtering Recommender Systems

Ticket automation: An insight into current research with applications to multi-level classification scenarios

Novel Keyword Extraction and Language Detection Approaches

The automated machine learning classification approach on telco trouble ticket dataset

References

Introduction to Information Retrieval

A vector space model for automatic indexing

The String-to-String Correction Problem

Domain names - concepts and facilities

Approximate string-matching with q -grams and maximal matches

Related Papers (5)

Approximate String Processing

Offline Approximate String Matching forInformation Retrieval : An experiment on technical documentation

Comparative Evaluation of String Metrics for Context Ontology Database

Socialising Data with Google Fusion Tables.

Fast Mining of Interesting Phrases from Subsets of Text Corpora