scispace - formally typeset
Proceedings ArticleDOI

String similarity algorithms for a ticket classification system

Malgorzata Pikies, +1 more
- pp 36-41
Reads0
Chats0
TLDR
This work considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed, leading to a 15% improvement in the ratio of false positives to true positive classifications.
Abstract
Fuzzy string matching allows for close, but not exactly, matching strings to be compared and extracted from bodies of text. As such, they are useful in systems which automatically extract and process documents. We summarise and compare various existing algorithms for achieving string similarity measures: Longest Common Subsequence (LCS), Dice coefficient, Cosine Similarity, Levenshtein distance and Damerau distance. Based on previously classified customer support enquiries (tickets), we considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest (such as error phrases, product names and warning messages) in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed. An optimal algorithm selection is made based on novel studies of the aforementioned similarity measures on text strings tokenised into characters. Such analysis also allowed for an optimum similarity threshold to be identified for various categories of enquiries, to reduce mismatched strings whilst allowing optimal coverage of the correctly matched key phrases. This led to a 15% improvement in the ratio of false positives to true positive classifications over the existing approach used by a customer support system.

read more

Citations
More filters
Journal ArticleDOI

Analysis and safety engineering of fuzzy string matching algorithms.

Malgorzata Pikies, +1 more
- 01 Jul 2021 - 
TL;DR: This paper compliments fuzzy string matching algorithms with a second layer Convolutional Neural Network (CNN) binary classifier, achieving an improved keyword classification ratio for two ticket categories by a relative 69% and 78%.
Book ChapterDOI

Using String-Comparison Measures to Improve and Evaluate Collaborative Filtering Recommender Systems

TL;DR: The general idea is to model the similarity computation between users as an approximate string matching problem and to employ classical algorithms that solve it and demonstrate that the measures based on a string-comparison approach can improve accuracy.
Posted Content

Novel Keyword Extraction and Language Detection Approaches

TL;DR: This paper proposes a fast novel approach to string tokenisation for fuzzy language matching and experimentally demonstrates an 83.6% decrease in processing time with an estimated improvement in recall at the cost of a 2.6%" decrease in precision.

The automated machine learning classification approach on telco trouble ticket dataset

TL;DR: In this article, the authors presented automated machine learning for solving a practical problem of a telco trouble ticket system, in particular, the focus is on the classification of early resolution code from the trouble ticket dataset.
References
More filters
Book

Introduction to Information Retrieval

TL;DR: In this article, the authors present an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents; methods for evaluating systems; and an introduction to the use of machine learning methods on text collections.
Journal ArticleDOI

A vector space model for automatic indexing

TL;DR: An approach based on space density computations is used to choose an optimum indexing vocabulary for a collection of documents, demonstating the usefulness of the model.
Journal ArticleDOI

The String-to-String Correction Problem

TL;DR: An algorithm is presented which solves the string-to-string correction problem in time proportional to the product of the lengths of the two strings.

Domain names - concepts and facilities

TL;DR: This memo describes the domain style names and their used for host address look up and electronic mail forwarding and discusses the clients and servers in the domain name system and the protocol used between them.
Journal ArticleDOI

Approximate string-matching with q -grams and maximal matches

TL;DR: Two string distance functions that are computable in linear time give a lower bound for the edit distance (in the unit cost model), which leads to fast hybrid algorithms for the edited distance based string matching.
Related Papers (5)