Showing papers on "Approximate string matching published in 2018"

PDF

Open Access

Journal Article•DOI•

Toponym matching through deep neural networks

[...]

Rui Santos¹, Patricia Murrieta-Flores², Pável Calado¹, Bruno Martins¹•Institutions (2)

Instituto Superior Técnico¹, University of Chester²

01 Feb 2018-International Journal of Geographical Information Science

TL;DR: This article presents a novel matching approach, leveraging a deep neural network to classify pairs of toponyms as either matching or nonmatching, and shows that the proposed method can significantly outperform individual similarity metrics from previous studies, as well as previous methods based on supervised machine learning for combining multiple metrics.

...read moreread less

Abstract: Toponym matching, i.e. pairing strings that represent the same real-world location, is a fundamental problemfor several practical applications. The current state-of-the-art relies on string similar...

...read moreread less

61 citations

Journal Article•DOI•

Fuzzy Matching Based on Gray-scale Difference for Quantum Images

[...]

Gaofeng Luo¹, Gaofeng Luo², Ri-Gui Zhou¹, XingAo Liu¹, WenWen Hu¹, Jia Luo¹ - Show less +2 more•Institutions (2)

Shanghai Maritime University¹, Shaoyang University²

08 May 2018-International Journal of Theoretical Physics

TL;DR: A fuzzy quantum image matching scheme based on gray-scale difference is proposed to find out the target region in a reference image, which is very similar to the template image and enables exponentially significant speedup via quantum parallel computation.

...read moreread less

Abstract: Quantum image processing has recently emerged as an essential problem in practical tasks, e.g. real-time image matching. Previous studies have shown that the superposition and entanglement of quantum can greatly improve the efficiency of complex image processing. In this paper, a fuzzy quantum image matching scheme based on gray-scale difference is proposed to find out the target region in a reference image, which is very similar to the template image. Firstly, we employ the proposed enhanced quantum representation (NEQR) to store digital images. Then some certain quantum operations are used to evaluate the gray-scale difference between two quantum images by thresholding. If all of the obtained gray-scale differences are not greater than the threshold value, it indicates a successful fuzzy matching of quantum images. Theoretical analysis and experiments show that the proposed scheme performs fuzzy matching at a low cost and also enables exponentially significant speedup via quantum parallel computation.

...read moreread less

17 citations

Book Chapter•DOI•

Machine Learning and Fuzzy Logic Techniques for Personalized Tutoring of Foreign Languages

[...]

Christos Troussas¹, Konstantina Chrysafiadi¹, Maria Virvou¹•Institutions (1)

University of Piraeus¹

27 Jun 2018

TL;DR: An intelligent tutoring system for learning English and French concepts is presented that incorporates a novel model for error diagnosis using machine learning and employs two algorithmic techniques.

...read moreread less

Abstract: Intelligent computer-assisted language learning employs artificial intelligence techniques to create a more personalized and adaptive environment for language learning. Towards this direction, this paper presents an intelligent tutoring system for learning English and French concepts. The system incorporates a novel model for error diagnosis using machine learning. This model employs two algorithmic techniques and specifically Approximate String Matching and String Meaning Similarity in order to diagnose spelling mistakes, mistakes in the use of tenses, mistakes in the use of auxiliary verbs and mistakes originating from confusion in the simultaneous tutoring of languages. The model for error diagnosis is used by the fuzzy logic model which takes as input the results of the first or the knowledge dependencies existing among the different domain concepts of the learning material and decides dynamically about the learning content that is suitable to be delivered to the learner each time.

...read moreread less

15 citations

Journal Article•DOI•

New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance

[...]

ThienLuan Ho¹, Seung-Rohk Oh¹, HyunJin Kim¹•Institutions (1)

Dankook University¹

01 May 2018-The Journal of Supercomputing

TL;DR: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance are proposed and several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure are shown.

...read moreread less

Abstract: This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.

...read moreread less

11 citations

Proceedings Article•DOI•

Classification of URL bitstreams using bag of bytes

[...]

Keiichi Shima¹, Daisuke Miyamoto², Hiroshi Abe¹, Tomohiro Ishihara³, Kazuya Okada³, Yuji Sekiya³, Hirochika Asai, Yusuke Dois¹ - Show less +4 more•Institutions (3)

Internet Initiative Japan¹, National Archives and Records Administration², University of Tokyo³

01 Feb 2018

TL;DR: This paper applies a mechanical approach to generate feature vectors from URL strings and can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain.

...read moreread less

Abstract: Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain. In this paper, we apply a mechanical approach to generate feature vectors from URL strings. We implemented our approach and tested with realistic URL access history data taken from a research organization and data from the famous archive site of phishing site information, PhishTank.com. Our approach achieved 2∼3% better accuracy compared to the existing DL- based approach.

...read moreread less

8 citations

Journal Article•DOI•

Fuzzy Matching of OpenAPI Described REST Services

[...]

Cong Peng¹, Prashant Goswami¹, Guohua Bai¹•Institutions (1)

Blekinge Institute of Technology¹

01 Jan 2018

TL;DR: The syntactic service matching methods based on the classical set theory have a role in discovering desired services for composition and orchestration in the vast amount of Web services.

...read moreread less

Abstract: The vast amount of Web services brings the problem of discovering desired services for composition and orchestration The syntactic service matching methods based on the classical set theory have a

...read moreread less

5 citations

Proceedings Article•DOI•

ReneGENE-DP: Accelerated Parallel Dynamic Programming for Genome Informatics

[...]

Santhi Natarajan¹, N. KrishnaKumar¹, Mysore S. Pavan², Debnath Pal¹, S. K. Nandy¹ - Show less +1 more•Institutions (2)

Indian Institute of Science¹, National Institute of Technology, Karnataka²

01 Mar 2018

TL;DR: ReneGENE-DP, implementations of the DP computations on hardware accelerators, with the novelty of realizing traceback in hardware in parallel with the forward scan during analysis, on both FPGA and GPU.

...read moreread less

Abstract: Parsing a very long genomic string (human genome is typically 3 billion characters long) abstracts the whole complexity of biocomputing. Approximate String Matching (ASM) is the most eligible computing paradigm that captures the biological complexity of the genome, integrating various sources of biological information into tractable probabilistic models. Though computationally complex, the Dynamic Programming (DP) methodology proves to be very efficient for ASM, in discriminating substantial similarities amongst severe noise in genetic data presented by evolution. Though a significant amount of computations in the DP algorithms are accelerated on multiple platforms, the less complex traceback step is still performed in the host, presenting significant memory and Input/Output bottleneck. With billions of such alignments required to analyse the genomic big data from the Next Generation Sequencing (NGS) Platforms, this bottleneck can severely affect system performance. This paper presents ReneGENE-DP, our implementations of the DP computations on hardware accelerators, with the novelty of realizing traceback in hardware in parallel with the forward scan during analysis, on both FPGA and GPU. The fastest FPGA implementation is around 43.63x better than the fastest GPU implementation of ReneGENE-DP, which in turn, is 380.85x faster than the reference design, which is a GPU based DP algorithm with traceback on host.

...read moreread less

5 citations

Proceedings Article•DOI•

Achieving Secure and Effective Search Services in Cloud Computing

[...]

Qin Liu¹, Shuyu Pei², Kang Xie, Jie Wu³, Tao Peng⁴, Guojun Wang⁴ - Show less +2 more•Institutions (4)

Beijing University of Posts and Telecommunications¹, Hunan University², Temple University³, Guangzhou University⁴

01 Aug 2018

TL;DR: This paper proposes a wildcard-based multi-keyword fuzzy search (WMFS) scheme over the encrypted data, which tolerates keyword misspellings by exploiting the indecomposable property of primes.

...read moreread less

Abstract: One critical challenge of today's cloud services is how to provide an effective search service while preserving user privacy In this paper, we propose a wildcard-based multi-keyword fuzzy search (WMFS) scheme over the encrypted data, which tolerates keyword misspellings by exploiting the indecomposable property of primes Compared with existing secure fuzzy search schemes, our WMFS scheme has the following merits: 1) Efficiency It eliminates the requirement of a predefined dictionary and thus supports updates efficiently 2) High accuracy It eliminates the false positive and false negative introduced by specific data structures and thus allows the user to retrieve files as accurate as possible 3) Flexibility It gives the user great flexibility to specify different search patterns including keyword and substring matching Extensive experiments on a real data set demonstrate the effectiveness and efficiency of our scheme

...read moreread less

5 citations

Patent•

Hybrid approach to approximate string matching using machine learning

[...]

Singh Pranjal¹, Banerjee Soumyajyoti•Institutions (1)

Visa Inc.¹

25 Oct 2018

TL;DR: In this article, a system can identify a corresponding string stored in memory based on an incomplete input string by analyzing phonetic and distance metrics for a plurality of strings stored in the memory.

...read moreread less

Abstract: Systems, apparatuses, and methods are provided for identifying a corresponding string stored in memory based on an incomplete input string A system can analyze and produce phonetic and distance metrics for a plurality of strings stored in memory by comparing the plurality of strings to an incomplete input string These similarity metrics can be used as the input to a machine learning model, which can quickly and accurately provide a classification This classification can be used to identify a string stored in memory that corresponds to the incomplete input string

...read moreread less

5 citations

Posted Content•

Squeezing More Out of Your Data: Business Record Linkage with Python

[...]

John Cuffe, Nathan Goldschlag

01 Jan 2018-Research Papers in Economics

TL;DR: The Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata, which leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches.

...read moreread less

Abstract: Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.

...read moreread less

4 citations

Journal Article•DOI•

Parallel String Matching with Linear Array, Butterfly and Divide and Conquer Models

[...]

S. Viswanadha Raju, K.S. Reddy, Chinta Someswara Rao

01 Jun 2018-Annals of Data Science

TL;DR: The experimental results have shown that the proposed string matching algorithms performs very well compared to those of Brute force, KMP and Boyer moorestring matching algorithms.

...read moreread less

Abstract: String Matching is a technique of searching a pattern in a text. It is the basic concept to extract the fruitful information from large volume of text, which is used in different applications like text processing, information retrieval, text mining, pattern recognition, DNA sequencing and data cleaning etc., . Though it is stated some of the simple mechanisms perform very well in practice, plenty of research has been published on the subject and research is still active in this area and there are ample opportunities to develop new techniques. For this purpose, this paper has proposed linear array based string matching, string matching with butterfly model and string matching with divide and conquer models for sequential and parallel environments. To assess the efficiency of the proposed models, the genome sequences of different sizes (10–100 Mb) are taken as input data set. The experimental results have shown that the proposed string matching algorithms performs very well compared to those of Brute force, KMP and Boyer moore string matching algorithms.

...read moreread less

Posted Content•

Sequence Searching Allowing for Non-Overlapping Adjacent Unbalanced Translocations

[...]

Domenico Cantone¹, Simone Faro², Arianna Pavone³•Institutions (3)

University of Catania¹, Seoul National University², University of Messina³

02 Dec 2018-arXiv: Data Structures and Algorithms

TL;DR: This paper presents the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors, and shows that under the assumptions of equiprobability and independence of characters the algorithm has a $O(n\log^2_{\sigma} m)$ average time complexity.

...read moreread less

Abstract: Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we investigate the \emph{approximate string matching} problem when the edit operations are non-overlapping unbalanced translocations of adjacent factors. In particular, we first present a $O(nm^3)$-time and $O(m^2)$-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a $O(n\log^2_{\sigma} m)$ average time complexity, for an alphabet of size $\sigma$, still maintaining the $O(nm^3)$-time and the $O(m^2)$-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors.

...read moreread less

Journal Article•DOI•

Information Extraction Framework to Build Legislation Network

[...]

Neda Sakhaee¹, Mark C. Wilson¹•Institutions (1)

University of Auckland¹

04 Dec 2018-arXiv: Information Retrieval

TL;DR: The study highlights the importance of data accuracy in network analysis and improves approximate string matching techniques to produce reliable network data-sets with more than 98% precision and recall.

...read moreread less

Abstract: This paper concerns an Information Extraction process for building a dynamic Legislation Network from legal documents. Unlike supervised learning approaches which require additional calculations, the idea here is to apply Information Extraction methodologies by identifying distinct expressions in legal text and extract quality network information. The study highlights the importance of data accuracy in network analysis and improves approximate string matching techniques for producing reliable network data-sets with more than 98 percent precision and recall. The values, applications, and the complexity of the created dynamic Legislation Network are also discussed and challenged.

...read moreread less

Posted Content•DOI•

Optimum Search Schemes for Approximate String Matching Using Bidirectional FM-Index

[...]

Kiavash Kianfar¹, Kiavash Kianfar², Kiavash Kianfar³, Kiavash Kianfar⁴, Kiavash Kianfar⁵, Kiavash Kianfar⁶, Christopher Pockrandt⁷, Bahman Torkamandi¹, Bahman Torkamandi², Bahman Torkamandi³, Bahman Torkamandi⁴, Bahman Torkamandi⁵, Bahman Torkamandi⁶, Haochen Luo⁶, Haochen Luo⁵, Haochen Luo⁴, Haochen Luo³, Haochen Luo², Haochen Luo¹, Knut Reinert⁷ - Show less +16 more•Institutions (7)

Texas Tech University¹, Lanzhou University², Texas A&M University³, Texas A&M Health Science Center College of Medicine⁴, Hospital Corporation of America⁵, Texas College⁶, Free University of Berlin⁷

13 Apr 2018-bioRxiv

TL;DR: In this paper, a mixed integer program (MIP) is proposed to solve the optimization problem for Hamming distance with given number of pieces, which significantly improves the performance of search in bidirectional FM-index.

...read moreread less

Abstract: Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today9s best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work.

...read moreread less

Patent•

Method and device for character string matching

[...]

Liu Xinran, Li Xiaoyu, Wang Wenbo, Xu Jiarui, Li Ming, Zhou Yu - Show less +2 more

05 Jan 2018

TL;DR: In this paper, a method and device for character string matching with fuzzy nodes is described, where the AC state machine generates regular nodes based on the non-wildcard relationship between the characters contained in each rule character string.

...read moreread less

Abstract: The invention discloses a method and device for character string matching. The method includes the steps that an AC state machine with fuzzy nodes is initialized, wherein the AC state machine generates regular nodes based on the non-wildcard relationship between the characters contained in each rule character string, and generates the corresponding fuzzy nodes according to the wildcard relationship between characters; target character strings are entered into the AC state machine, each character in the target character strings is compared with the corresponding character of each node in the ACstate machine, one or more rule character strings matching the target strings is determined, and a corresponding operation is performed according to the matched rule character strings. According to the technical scheme, after obtaining the target character strings, the target character strings are entered into the AC state machine to match, the matched one or more rule character strings in the target character string is determined, the multi-segment fuzzy matching is achieved, the flexible ability defined by the rule character strings is guaranteed, and the application demand is satisfied.

...read moreread less

Journal Article•DOI•

Fuzzy String Matching with a Deep Neural Network

[...]

Daniel Shapiro¹, Nathalie Japkowicz¹, Mathieu Lemay, Miodrag Bolic¹•Institutions (1)

University of Ottawa¹

19 Mar 2018-Applied Artificial Intelligence

TL;DR: A deep learning neural network for character-level text classification of noisy text spots keywords in the text output of an optical character recognition system using memoization and by encoding the text into feature vectors related to letter frequency.

...read moreread less

Abstract: A deep learning neural network for character-level text classification is described in this work. The system spots keywords in the text output of an optical character recognition system using memoi...

...read moreread less

Journal Article•DOI•

A review on parameterized string matching algorithms

[...]

Rama Singh¹, Deepak Rai¹, Rajesh Prasad²•Institutions (2)

Ajay Kumar Garg Engineering College¹, American University of Nigeria²

02 Jan 2018-Journal of Information and Optimization Sciences

TL;DR: This paper displays the review of single pattern parameterized pattern matching algorithms using q-grams having linear time complexities and discusses about the best algorithm that has less number of false matches.

...read moreread less

Abstract: Identification of candidate genes and nucleotides are basic uses of the Bioinformatics research. Biology that deals with molecule has functional as well as structural behavior imparting need of wel...

...read moreread less

Proceedings Article•DOI•

Research on Multi-Spatial Keyword Fuzzy Query Algorithm in Spatial Data

[...]

Suzhi Zhang¹, Rui Yang¹, Yanan Zhao¹•Institutions (1)

Zhengzhou University of Light Industry¹

25 Aug 2018

TL;DR: A multi-space keyword fuzzy query algorithm that is converted to the Morton code matching to improve the query efficiency, and with the fuzzy matching algorithm to support query fault tolerance is proposed.

...read moreread less

Abstract: With the large-scale use of smart devices with positioning capabilities, more and more spatial data is generated and each piece of data contains more and more information. However, the past space keyword query algorithm only for a single keyword query, which cannot meet the users' more personalized needs. Therefore, this paper proposes a multi-space keyword fuzzy query algorithm. In this algorithm, the past two-dimensional space distance calculation is converted to the Morton code matching to improve the query efficiency, and with the fuzzy matching algorithm to support query fault tolerance. Experimental results show that the proposed algorithm improves query efficiency and users' satisfaction.

...read moreread less

Patent•

Fuzzy matching method for equipment model name

[...]

Tian Zhenxing, Huang Guilan, Shi Muzhi, Yang Yujing, Zhang Xiaomin, Qian Jinxing, Dai Jie - Show less +3 more

01 May 2018

TL;DR: In this paper, a fuzzy matching method for an equipment model name was proposed, which aims at the particularity of the equipment name, a similar distance between the character strings of two equipment names was calculated through an improved Jaro-Winkler algorithm, and a threshold value was set to judge whether the character string of two names expressed the same equipment model or not.

...read moreread less

Abstract: The invention discloses a fuzzy matching method for an equipment model name. The method aims at the particularity of the equipment name, a similar distance between the character strings of two equipment names is calculated through an improved Jaro-Winkler algorithm, and a threshold value is set to judge whether the character strings of two equipment names express the same equipment model or not. By use of the method, differences expressed by Chinese characters and Pinyin initials are shielded, meanwhile, an influence of a digital mark number for modal name matching is improved, and whether thecharacter strings of the equipment name are matched or not can be accurately judged.

...read moreread less

Journal Article•DOI•

Data preparation and fuzzy matching techniques for improved statistical modeling

[...]

Stephen Sloan, Kirk Paul Lafler

01 Jan 2018-Model Assisted Statistics and Applications

Book Chapter•DOI•

Integrating Approximate String Matching with Phonetic String Similarity

[...]

Junior Ferri¹, Hegler Tissot¹, Marcos Didonet Del Fabro¹•Institutions (1)

Federal University of Paraná¹

02 Sep 2018

TL;DR: This work presents an approach that transforms the dictionary and each input token into a compact well-known phonetic representation and applies a second similarity measure to filter the best result to annotate a given entity.

...read moreread less

Abstract: Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.

...read moreread less

Book Chapter•DOI•

Fast Plagiarism Detection Using Approximate String Matching and Vector Representation of Words

[...]

Kensuke Baba¹•Institutions (1)

Fujitsu¹

01 Jan 2018

TL;DR: The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.

...read moreread less

Abstract: Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.

...read moreread less

Patent•

Fuzzy matching for computing resources

[...]

Xia Yu

08 Nov 2018

TL;DR: In this paper, a comparison length based on lengths of strings in a set of strings is used to normalize strings to identify computing resources and improve performance and utilization of computing resources.

...read moreread less

Abstract: Systems and methods are disclosed for normalizing strings to identify computing resources and improve performance and utilization of computing resources. For example, methods may include determining a comparison length based on lengths of strings in a set of strings; padding a first string from the set of strings to the comparison length to obtain a padded string; receiving a second string; determining a distance between the second string and the padded string; and identifying a match between the first string and the second string based on the distance.

...read moreread less

Patent•

Drawing similarity comparison system

[...]

Xu Bo, Jiang Jing, Li Kunjian, Pan Feng, Zhu Jian - Show less +1 more

25 May 2018

TL;DR: In this paper, a drawing similarity comparison system is presented, which comprises a building module for obtaining feature library pictures and pictures to be retrieved and obtaining a low frequency component through calculation of Fourier transform.

...read moreread less

Abstract: The invention discloses a drawing similarity comparison system. The system comprises a building module for obtaining feature library pictures and pictures to be retrieved and obtaining a low frequencycomponent through calculation of Fourier transform, a unification processing module for enabling the obtained low frequency component to generate corresponding feature indexing character string data,and a data processor for importing the feature indexing character string data of the feature library pictures into a database and encoding the feature library pictures and providing matched samples for retrieval. Feature character strings are obtained through Fourier transform, only front-to-back comparison is needed for character string similarity comparison, and the influence of the rear character on the similarity is only one half that of the previous character; keywords in the database can be directly adopted for fuzzy search, the retrieval efficiency is high, and the calculated quantityis small.

...read moreread less

Book Chapter•DOI•

Obtaining Precision-Recall Trade-Offs in Fuzzy Searches of Large Email Corpora

[...]

Kyle Porter¹, Slobodan Petrovic¹•Institutions (1)

Norwegian University of Science and Technology¹

03 Jan 2018

TL;DR: This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches.

...read moreread less

Abstract: Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.

...read moreread less

Patent•

Sensor node identifier platform-based information management and analysis method

[...]

Yang Weishuai, Luo Zhiyong, Shen Xun, Cai Ting, Zhang Rou, Ji Liangyuan, Zhang Xue - Show less +3 more

20 Jul 2018

TL;DR: In this paper, a sensor node identifier platform-based information management and analysis method is proposed to deal with the data information construction requirements of an industrial sensing network, where the sensor information is effectively managed and analyzed.

...read moreread less

Abstract: The invention discloses a sensor node identifier platform-based information management and analysis method. The method comprises the following steps of 1) extracting sensor information data; 2) performing database construction on the extracted sensor information data; 3) matching the sensor information data with sensor node identifiers by applying a fuzzy matching algorithm; and 4) performing correlation analysis on sensor information by applying a data correlation analysis algorithm. The sensor information is effectively managed and analyzed, so that the data information construction requirements of an industrial sensing network can be met.

...read moreread less

Patent•

Detect Duplicates with Exact and Fuzzy Matching on Encrypted Match Indexes

[...]

Alexandre Hersans¹, Swaroop Shere¹, Chenghung Ker¹, Parth Vaishnav¹, Assaf Ben-Gur¹, Victor Liu¹, Daniel McGarry¹, Samatha Sanikommu¹ - Show less +4 more•Institutions (1)

Salesforce.com¹

03 Jul 2018

TL;DR: In this article, a system, method, and computer program product embodiment for detecting duplicates with exact and fuzzy matching on encrypted match indexes using an encryption key in a cloud computing platform is presented.

...read moreread less

Abstract: Disclosed herein are system, method, and computer program product embodiments for detecting duplicates with exact and fuzzy matching on encrypted match indexes using an encryption key in a cloud computing platform. An embodiment operates by determining a match rule index value upon reception of a new record. The embodiment encrypts the match index rule value using the customer's encryption key and a deterministic encryption method and stores the encrypted match rule index value. Duplicate detection may be later performed by using the same deterministic encryption method to determine a cypher text for a candidate entry and comparing the ciphertext to the stored encrypted match indexes.

...read moreread less

Patent•

Power grid data association method based on address matching technology

[...]

Wang Zongwei, Chen Peng, Sheng Yan, Jin Peng, Li Yanyan, Bu Xiaoyang, Zhao Guoyi, Zhang Quan, Liu Kunpeng, Gong Lihua, Yang Jing - Show less +7 more

17 Aug 2018

TL;DR: In this paper, a power grid data association method based on an address matching technology is proposed, which achieves the aim that repair and complaint work orders without client numbers are associated to basic file information of clients through an address fuzzy matching technology.

...read moreread less

Abstract: The invention relates to a power grid data association method based on an address matching technology. The method achieves the aim that repair and complaint work orders without client numbers are associated to basic file information of clients through an address fuzzy matching technology. The power grid data association method comprises the following steps that pre-processing is conducted; repairaddress information of the clients is received and stored as text information; structured address information, special characters and Arabic numbers are deleted; Chinese address character strings aresubjected to editing distance calculation; a Chinese address character string of a user corresponding to a minimum calculation result is determined as a repair address of the user. Compared with the prior art, the power grid data association method has the advantages that a mode for achieving structured data association by using non-structured data association is provided, and fusion analysis of trans-disciplinary data is achieved; an address matching degree is calculated based on a minimum editing distance algorithm, adoption of Chinese word segmentation is avoided, and the misjudgment probability is lowered.

...read moreread less

Journal Article•DOI•

Preparation of Improved Turkish DataSet for Sentiment Analysis in Social Media

[...]

Semiha Makinist, Ibrahim Riza Hallac¹, Betul Ay Karakus¹, Galip Aydin¹•Institutions (1)

Fırat University¹

30 Jan 2018-arXiv: Computation and Language

TL;DR: In this paper, the authors presented an improved Turkish dataset with an effective spelling correction algorithm based on Hadoop, which can be used as an open source dataset in sentiment analysis studies, have been performed successfully to the detection and correction of spelling errors.

...read moreread less

Abstract: A public dataset, with a variety of properties suitable for sentiment analysis [1], event prediction, trend detection and other text mining applications, is needed in order to be able to successfully perform analysis studies. The vast majority of data on social media is text-based and it is not possible to directly apply machine learning processes into these raw data, since several different processes are required to prepare the data before the implementation of the algorithms. For example, different misspellings of same word enlarge the word vector space unnecessarily, thereby it leads to reduce the success of the algorithm and increase the computational power requirement. This paper presents an improved Turkish dataset with an effective spelling correction algorithm based on Hadoop [2]. The collected data is recorded on the Hadoop Distributed File System and the text based data is processed by MapReduce programming model. This method is suitable for the storage and processing of large sized text based social media data. In this study, movie reviews have been automatically recorded with Apache ManifoldCF (MCF) [3] and data clusters have been created. Various methods compared such as Levenshtein and Fuzzy String Matching have been proposed to create a public dataset from collected data. Experimental results show that the proposed algorithm, which can be used as an open source dataset in sentiment analysis studies, have been performed successfully to the detection and correction of spelling errors.

...read moreread less

Patent•

Character identification method, device, equipment and readable storage medium

[...]

Chen Guodong, Ding Zihao, Wang Zheng, Wang Zhenhua, Sun Lining - Show less +1 more

02 Oct 2018

TL;DR: In this article, a character identification method, device, equipment and a readable storage medium is presented, which consists of the following steps: preprocessing a gathered to-be-identified image, and obtaining characters in the to-besidentified image; respectively matching the characters with templates in a pre-built fuzzy matching template database, thus obtaining an initial matching result and a matching rate of each character in the TO-BE-ID image; analyzing structure characteristics of characters with a plurality of matching results and/or low matching rate in TOI image; finally obtaining the final matching result

...read moreread less

Abstract: The invention discloses a character identification method, device, equipment and a readable storage medium; the method comprises the following steps: preprocessing a gathered to-be-identified image, and obtaining characters in the to-be-identified image; respectively matching the characters in the to-be-identified image with templates in a pre-built fuzzy matching template database, thus obtainingan initial matching result and a matching rate of each character in the to-be-identified image; analyzing structure characteristics of characters with a plurality of matching results and/or low matching rate in the to-be-identified image, thus obtaining the final matching result of the characters with the plurality of matching results and/or low matching rate in the to-be-identified image The character identification method, device, equipment and the readable storage medium can improve the character identification accuracy and efficiency

...read moreread less