Showing papers on "Approximate string matching published in 2019"
••
TL;DR: The main purpose of this survey is to propose new classification, identify new directions and highlight the possible challenges, current trends, and future works in the area of string matching algorithms with a core focus on exactstring matching algorithms.
Abstract: String matching has been an extensively studied research domain in the past two decades due to its various applications in the fields of text, image, signal, and speech processing. As a result, choosing an appropriate string matching algorithm for current applications and addressing challenges is difficult. Understanding different string matching approaches (such as exact string matching and approximate string matching algorithms), integrating several algorithms, and modifying algorithms to address related issues are also difficult. This paper presents a survey on single-pattern exact string matching algorithms. The main purpose of this survey is to propose new classification, identify new directions and highlight the possible challenges, current trends, and future works in the area of string matching algorithms with a core focus on exact string matching algorithms.
69 citations
••
08 Apr 2019TL;DR: M-Join is proposed, a multi-level filtering approach for fuzzy string similarity join that provides a flexible framework that can support multiple similarity functions at both levels and clearly outperforms state-of-the-art methods.
Abstract: As an essential operation in data integration and data cleaning, similarity join has attracted considerable attention from the database community. In many application scenarios, it is essential to support fuzzy matching, which allows approximate matching between elements that improves the effectiveness of string similarity join. To describe the fuzzy matching between strings, we consider two levels of similarity, i.e., element-level and record-level similarity. Then the problem of calculating fuzzy matching similarity can be transformed into finding the weighted maximal matching in a bipartite graph. In this paper, we propose MF-Join, a multi-level filtering approach for fuzzy string similarity join. MF-Join provides a flexible framework that can support multiple similarity functions at both levels. To improve performance, we devise and implement several techniques to enhance the filter power. Specifically, we utilize a partition-based signature at the element-level and propose a frequency-aware partition strategy to improve the quality of signatures. We also devise a count filter at the record level to further prune dissimilar pairs. Moreover, we deduce an effective upper bound for the record-level similarity to reduce the computational overhead of verification. Experimental results on two popular datasets shows that our proposed method clearly outperforms state-of-the-art methods.
34 citations
•
TL;DR: SneakySnake is introduced, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment and is efficient to implement on CPUs, GPUs, and FPGAs.
Abstract: Motivation: We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs, and FPGAs. Results: SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers's bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7x and 43.9x (>12x on average), respectively, with its CPU implementation, and by up to 413x and 689x (>400x on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979x (276.9x on average) and 91.7x (31.7x on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g., configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. Availability: this https URL
28 citations
••
01 Dec 2019TL;DR: In this article, a virtual-black-box (VBB) secure and input-hiding obfuscator for fuzzy matching for Hamming distance, based on certain natural number-theoretic computational assumptions, is presented.
Abstract: We consider the problem of obfuscating programs for fuzzy matching (in other words, testing whether the Hamming distance between an n-bit input and a fixed n-bit target vector is smaller than some predetermined threshold). This problem arises in biometric matching and other contexts. We present a virtual-black-box (VBB) secure and input-hiding obfuscator for fuzzy matching for Hamming distance, based on certain natural number-theoretic computational assumptions. In contrast to schemes based on coding theory, our obfuscator is based on computational hardness rather than information-theoretic hardness, and can be implemented for a much wider range of parameters. The Hamming distance obfuscator can also be applied to obfuscation of matching under the \(\ell _1\) norm on \(\mathbb {Z}^n\).
17 citations
••
14 May 2019TL;DR: A novel method for addressing 'Fuzzy Matching', which exploits the fact most server-class CPUs include vector operations to parallelize message matching and allows matches based on 'partial truth', i.e., by identifying probable rather than exact matches.
Abstract: Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI_THREAD_MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.
11 citations
••
01 Sep 2019
TL;DR: Analysis of spoken document retrieval techniques which apply word similarity based on phonemic transcriptions building or approximate string matching on the collection of spoken documents with speech on Russian language is obtained.
Abstract: The article contains analysis of spoken document retrieval techniques which apply word similarity based on phonemic transcriptions building or approximate string matching. Results are obtained on the collection of spoken documents with speech on Russian language. Grapheme-to-phoneme conversion methods based on a hidden Markov model and 1,2-order finite Markov chain is discussed on the article.
9 citations
••
01 Dec 2019TL;DR: This work extends existing filtering-based subgraph matching algorithms and proposes a new set of filters leveraging the monotone function properties in the multiplex setting that enables effective pruning of irrelevant subgraph regions and expedites the overall matching process.
Abstract: We study the problem of detecting matching subgraphs in a large multiplex background network based on predefined subgraph templates. Our approach extends existing filtering-based subgraph matching algorithms and proposes a new set of filters leveraging the monotone function properties in the multiplex setting. This enables effective pruning of irrelevant subgraph regions and expedites the overall matching process. In addition, our approach proposes a new strategy based on maximum likelihood estimate to identify “closely matched” subgraphs that are not isomorphic to the given templates from a noisy background network. This allows us to generalize this approach to real-world networks, which are often noisy, incomplete and ambiguous. We demonstrate the effectiveness of the proposed method on a real-world multiplex network provided by the DARPA Modeling Adversarial Activity (MAA) program. Our approach obtains highly accurate subgraph matching results for both the clean and noisy versions of the network, which significantly outperforms the baseline filtering methods. Furthermore, our proposed approach is parallelizable such that it can scale up to handle large input networks.
8 citations
••
TL;DR: This work presents a novel public-key construction for secure two-party evaluation of threshold functions in restricted domains based on embeddings found in the message spaces of additively homomorphic encryption schemes.
Abstract: Real-world applications of record linkage often require matching to be robust in spite of small variations in string fields. For example, two health care providers should be able to detect a patient in common, even if one record contains a typo or transcription error. In the privacy-preserving setting, however, the problem of approximate string matching has been cast as a trade-off between security and practicality, and the literature has mainly focused on Bloom filter encodings , an approach which can leak significant information about the underlying records. We present a novel public-key construction for secure two-party evaluation of threshold functions in restricted domains based on embeddings found in the message spaces of additively homomorphic encryption schemes. We use this to construct an efficient two-party protocol for privately computing the threshold Dice coefficient. Relative to the approach of Bloom filter encodings, our proposal offers formal security guarantees and greater matching accuracy. We implement the protocol and demonstrate the feasibility of this approach in linking medium-sized patient databases with tens of thousands of records.
8 citations
••
06 Jan 2019
TL;DR: An algorithm for pattern matching with mismatches running in time O((n + m) poly(k)).
Abstract: A fundamental problem on strings in the realm of approximate string matching is pattern matching with mismatches: Given a text t, a pattern p, and a number k, determine whether some substring of t has Hamming distance at most k to p; such a substring is called a k-match.As real-world texts often come in compressed form, we study the case of searching for a small pattern p in a text t that is compressed by a straight-line program. This grammar compression is popular in the string community, since it is mathematically elegant and unifies many practically relevant compression schemes such as the Lempel-Ziv family, dictionary methods, and others. We denote by m the length of p and by n the compressed size of t. While exact pattern matching, that is, the case k = 0, is known to be solvable in near-linear time O (n + m) [Jez TALG'15], despite considerable interest in the string community, the fastest known algorithm for pattern matching with mismatches runs in time [MATH HERE] [Gawrychowski, Straszak ISAAC'13], which is far from linear even for very small k.In this paper, we obtain an algorithm for pattern matching with mismatches running in time O((n + m) poly(k)). This is near-linear in the input size for any constant (or slightly superconstant) k. We obtain analogous running time for counting and enumerating all k-matches.Our algorithm is based on a new structural insight for approximate pattern matching, essentially showing that either the number of k-matches is very small or both text and pattern must be almost periodic. While intuitive and simple for exact matches, such a characterization is surprising when allowing k mismatches.
8 citations
••
01 Aug 2019
TL;DR: The IIT Patna’s submission to WMT 2019 shared task on parallel corpus filtering is described and the scoring method obtains 2nd position in the team ranking for 1-million NepaliEnglish NMT and 5-million Sinhala- English NMT categories.
Abstract: In this paper, we describe the IIT Patna’s submission to WMT 2019 shared task on parallel corpus filtering. This shared task asks the participants to develop methods for scoring each parallel sentence from a given noisy parallel corpus. Quality of the scoring method is judged based on the quality of SMT and NMT systems trained on smaller set of high-quality parallel sentences sub-sampled from the original noisy corpus. This task has two language pairs. We submit for both the Nepali-English and Sinhala-English language pairs. We define fuzzy string matching score between English and the translated (into English) source based on Levenshtein distance. Based on the scores, we sub-sample two sets (having 1 million and 5 millions English tokens) of parallel sentences from each parallel corpus, and train SMT systems for development purpose only. The organizers publish the official evaluation using both SMT and NMT on the final official test set. Total 10 teams participated in the shared task and according the official evaluation, our scoring method obtains 2nd position in the team ranking for 1-million NepaliEnglish NMT and 5-million Sinhala-English NMT categories.
8 citations
••
01 Jan 2019
TL;DR: This thesis addresses important algorithms and data structures used in sequence analysis for applications such as read mapping, and introduces a recently published FM index based on a new data structure: EPR dictionaries.
••
23 Apr 2019
TL;DR: This work considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed, leading to a 15% improvement in the ratio of false positives to true positive classifications.
Abstract: Fuzzy string matching allows for close, but not exactly, matching strings to be compared and extracted from bodies of text. As such, they are useful in systems which automatically extract and process documents. We summarise and compare various existing algorithms for achieving string similarity measures: Longest Common Subsequence (LCS), Dice coefficient, Cosine Similarity, Levenshtein distance and Damerau distance. Based on previously classified customer support enquiries (tickets), we considered the effectiveness of different algorithms and configurations to automatically identify keywords of interest (such as error phrases, product names and warning messages) in instances where such key phrases are misspelled, copied incorrectly or are otherwise differently formed. An optimal algorithm selection is made based on novel studies of the aforementioned similarity measures on text strings tokenised into characters. Such analysis also allowed for an optimum similarity threshold to be identified for various categories of enquiries, to reduce mismatched strings whilst allowing optimal coverage of the correctly matched key phrases. This led to a 15% improvement in the ratio of false positives to true positive classifications over the existing approach used by a customer support system.
••
14 Jul 2019TL;DR: In this article, the traffic time-points data is transformed to a string, which is used by new fast approximate string matching algorithm to detect anomalies in DNS traffic, and the approach is generic in its nature and allows fast adaptation to different types of traffic.
Abstract: In this paper we propose a novel approach to identify anomalies in DNS traffic. The traffic time-points data is transformed to a string, which is used by new fast approximate string matching algorithm to detect anomalies. Our approach is generic in its nature and allows fast adaptation to different types of traffic. We evaluate the approach on a large public dataset of DNS traffic based on 10 days, discovering more than order of magnitude DNS attacks in comparison to auto-regression as a baseline. Moreover, the additional comparison has been made including other common regressors such as Linear Regression, Lasso, Random Forest and KNN, all of them showing the superiority of our approach.
••
10 Dec 2019TL;DR: In this paper, the Damerau-Levenshtein Distance Approximate String Matching algorithm was used to calculate the editing distance of each word in a keywords with each word from the Indonesian word dictionary.
Abstract: Searching is one of the important features on the website, but it is not uncommon for users to make typos when typing keywords. Typing errors of these keywords is usually referred to as typo. This study aims to build a system by providing suggestions for correcting typos in the search feature. Keywords search correction are obtained using the Damerau-Levenshtein Distance Approximate String Matching algorithm by to calculate the editing distance of each word in a keywords with each word in the Indonesian word dictionary. Testing was carried out as many as 40 experiments, with 10 keywords and 250 articles taken randomly. The test results show the Damerau-Levenshtein Distance algorithm is able to provide precision and recall values of 91.24% and 89.58% in providing keyword improvement suggestions. With the improvement of the system, each trial increases with precision value of 0.80 and recall value of 0.98
••
24 Oct 2019
TL;DR: The research proved that accurate usable data can be obtained automatically from images with the implementation of a workflow after OCR.
Abstract: Companies are relying more on artificial intelligence and machine learning in order to enhance and automate existing business processes. While the power of OCR (Optical Character Recognition) technologies can be harnessed for the digitization of image data, the digitalized text still needs to be validated and enhanced to ensure that data quality standards are met for the data to be usable. This research paper focuses on finding and creating an automated workflow that can follow image digitization and produce a dictionary consisting of the desired information. The workflow introduced consists of a three-step process that is implemented after the OCR output has been generated. With the introduction of each step, the accuracy of key-value matches of field names and values is increased. The first step takes the raw OCR output and identifies field names using exact string matching and field-values using regular expressions from an externally maintained file. The second step introduces index pairing that matches field-values to field names based on the location of the field name and value on the document. Finally, approximate string matching is introduced to the workflow, which increases accuracy. By implementing these steps, the F-measure for key-value pair matches is measured at 60.18% in the first step, 80.61% once index pairing is introduced, and finally 90.06% after approximate string matching is introduced. The research proved that accurate usable data can be obtained automatically from images with the implementation of a workflow after OCR.
••
12 Dec 2019TL;DR: An automatic assignment generation and assessment system to help students learn parallel programming that can automatically generate an overall assessment of student assignments by using fuzzy string matching, which provides an approximate reference score of objective questions.
Abstract: The course of parallel programming is becoming more and more important for the education of students majoring in computer science. However, it is not easy to learn parallel programming well due to its high theory and practice requirements. In this paper, we design and implement an automatic assignment generation and assessment system to help students learn parallel programming. The assignments can be generated according to user behaviors and thus able to guide students to learn parallel programming step by step. Besides, it can automatically generate an overall assessment of student assignments by using fuzzy string matching, which provides an approximate reference score of objective questions. Subjective questions can be assessed directly by comparing the answer to the reference answer. This system also provides a friendly user interface for students to complete online assignments and let teachers manage their question database. In our teaching practice, students can learn parallel programming more effectively with the help of such an assignment generation and assessment system.
•
16 May 2019
TL;DR: In this paper, a ranking score based on a combination of the fuzzy match score and the match frequency score for a corresponding piece of data from the flat data is calculated for each node in a graph structure.
Abstract: In an example, for each one or more search terms, pieces of data from flat data are searched to locate one or more matching pieces of data from the flat data, wherein a piece of data from the flat data matches if it contains at least one attribute with a value that is similar to the search term. Then, for each matching piece of data from the flat data, a fuzzy match score and a match frequency score are calculated. For each node in a graph structure, a ranking score based on a combination of the fuzzy match score and the match frequency score for a corresponding piece of data from the flat data is calculated. One or more search results are retrurned based on the ranking scores of nodes corresponding to pieces of data for the one or more search results.
••
01 Nov 2019TL;DR: A framework, namely, Electronic Patient Matching System (EPMS), which attempts to overcome barriers while achieving a good accuracy in matching patient records and encodes the patient records using variational autoencoder and amalgamates them by performing locality sensitive hashing on an Apache spark cluster.
Abstract: The healthcare industry, through digitization, is trying to achieve interoperability, but has not been able to achieve complete Health Information Exchange (HIE). One of the major challenges in achieving this is the inability to accurately match patient data. Mismatching of patient records can lead to improper treatment which can prove to be fatal. Also, the presence of duplicate overheads has caused inaccessibility to crucial information in the time of need. Existing solutions to patient matching are both time-consuming and non-scalable. This paper proposes a framework, namely, Electronic Patient Matching System (EPMS), which attempts to overcome these barriers while achieving a good accuracy in matching patient records. The framework encodes the patient records using variational autoencoder and amalgamates them by performing locality sensitive hashing on an Apache spark cluster. This makes the process faster and highly scalable. Furthermore, a fuzzy matching of the records in each block is performed using Levenshtein distances to identify the duplicate patient records. Experimental investigations were performed on a synthetically generated dataset consisting of 44555 patient records. The proposed framework achieved a matching accuracy of 81.15% on this dataset.
••
28 May 2019TL;DR: This work proposes several changes to usual edit sequences, specifically augmenting edits with content data and using fuzzy matching, in an attempt to improve semantic preservation.
Abstract: Genetic improvement uses automated search to find improved versions of existing software. Edit sequences have been proposed as a very convenient way to represent code modifications, focusing on the changes themselves rather than duplicating the entire program. However, edits are usually defined in terms of practical operations rather than in terms of semantic changes; indeed, crossover and other edit sequence mutations usually never guarantee semantic preservation. We propose several changes to usual edit sequences, specifically augmenting edits with content data and using fuzzy matching, in an attempt to improve semantic preservation.
••
11 Jul 2019
TL;DR: A metric to determine the similarity of two routes of vessels based on discrete readings of their positions at different times, a clustering method for processing the data available in «Hart-12» and definitions for typical routes are proposed.
Abstract: The article presents a method of detection of the spatial anomalies of vessel traffic based on «Hart-12» Information-Telecommunication Maritime Security System. Within this methodology, the authors propose a metric to determine the similarity of two routes of vessels based on discrete readings of their positions at different times, a clustering method for processing the data available in «Hart-12» and definitions for typical routes. When assessing risks of violation of border legislation at the maritime boundary, it is proposed to take into account the correspondence of the spatial data of the route to its descriptive part as well as average and maximum deviations from a typical route. When verifying descriptive data, the fuzzy matching algorithm based on the Levenshtein distance is used
•
28 May 2019
TL;DR: In this paper, a transaction sanction list matching system consisting of a message queue module, a client name analysis module and a fuzzy matching module is presented, where the fuzzy matching is performed on the client names to cut a list and output the list.
Abstract: The invention discloses a transaction sanction list matching system. The system comprises a message queue module, a client name analysis module and a fuzzy matching module; the message queue module ismainly responsible for uninterruptedly subscribing to receive the transaction information flow, analyzing transaction information and extracting client information; the client name analysis module isused for extracting and analyzing client names from the extracted client information by using an NLP model constructed by using an NLP technology trained by a machine learning algorithm, and finally,fuzzy matching is performed on the client names in the fuzzy matching module to cut a list and output the list. According to the invention, the mismatching rate of the traditional customer identity list cutting system can be reduced by more than 50%; According to the invention, the client identity recognition can be improved from character matching to the client information association matching level, and the identity authentication matching reliability is greatly improved.
01 Jan 2019
TL;DR: An approximate string matching method using weighted edit distance for searching keywords in OCR-ed business documents and the evaluation on a Czech invoice dataset shows that the method can detect a significant part of erroneous keywords.
Abstract: Optical Character Recognition (OCR) is achieving higher accuracy. However, to decrease error rate down to zero is still a human desire. This paper presents an approximate string matching method using weighted edit distance for searching keywords in OCR-ed business documents. The evaluation on a Czech invoice dataset shows that the method can detect a significant part of erroneous keywords.
•
18 Jun 2019
TL;DR: In this article, a character string fuzzy matching and querying method based on an editing distance was proposed, where the matching degree of the character string is smaller than a preset lower bound value and no repeated element exists in the position list.
Abstract: The invention discloses a character string fuzzy matching and querying method based on an editing distance. The method comprises the following steps: sequentially dividing the query strings accordingto the lengths of the character strings in the paragraphs so as to obtain a query string substring set; when the character strings in the paragraphs are matched with the character strings in the querystring, adding the length of the character string into the matching degree of the original character string of the index corresponding to the character string, adding the character string into a result set when the matching degree of the character string is greater than a preset upper bound value and no repeated element exists in the position list, otherwise, verifying the editing distance of thecharacter string; When the matching degree of the character string is smaller than a preset lower bound value, directly filtering the character string; When the matching degree of the character string is between the preset lower bound value and the preset upper bound value, editing distance verification is carried out on the character string, the difference that paragraphs of different lengths donot affect the matching result can be reflected through the method, and meanwhile the number of editing distance verification operations is small.
••
TL;DR: A new fuzzy matching method is introduced to create an adaptive matching zone (region) where the two corresponding features from different frames can be matched perfectly and the magnitude and direction of the motion is used for accurate elimination of camera motion.
Abstract: In computer vision, the multiple objects tracking play a vital challenging role. To solve the issues in this research field, various traditional techniques had been developed. In this paper, we consider the problem of tracking multiple persons in a dynamic environment (background) such as illumination changes and shadow moving. Notably, i) Estimating camera motion and ii) Multiple persons tracking are the two main phases involved in our proposed approach. In the first phase, the good features were extracted using both the SIFT features extraction steps and Gaussian noise elimination method. Instead of using the conventional SIFT-based matching method, we have introduced a new fuzzy matching method to create an adaptive matching zone (region). Using this, the two corresponding features from different frames can be matched perfectly. The brightness of a matching feature of interest indicates its size. Additionally, we use the magnitude and direction of the motion for accurate elimination of camera motion. In the second phase, the persons are tracked from the moving object by finding the optimal feature points and clustering of final points are made as the moving persons (objects). Experimental validation was performed on different challenging datasets and promising results are achieved by our proposed method compared to other existing methods.
•
TL;DR: The approach is evaluated on a large public dataset of DNS traffic based on 10 days, discovering more than order of magnitude DNS attacks in comparison to auto-regression as a baseline.
Abstract: In this paper we propose a novel approach to identify anomalies in DNS traffic. The traffic time-points data is transformed to a string, which is used by new fast appproximate string matching algorithm to detect anomalies. Our approach is generic in its nature and allows fast adaptation to different types of traffic. We evaluate the approach on a large public dataset of DNS traffic based on 10 days, discovering more than order of magnitude DNS attacks in comparison to auto-regression as a baseline. Moreover, the additional comparison has been made including other common regressors such as Linear Regression, Lasso, Random Forest and KNN, all of them showing the superiority of our approach.
••
01 Sep 2019TL;DR: This paper proposes the optimal solution to the fuzzy segmentation problem and considers its application to the decomposition of a given function.
Abstract: The problem of segmentation of a given string to match a fuzzy pattern is considered in this paper. The fuzzy pattern is defined as a sequence of fuzzy properties. It is assumed that each string can match a fuzzy property in some measure. Being increasing, decreasing, or oscillating are examples of fuzzy properties of a numerical sequence. The problem we consider is how to split the given string (sequence) into sufficiently long substrings (contiguous subsequences) to match the pattern as well as possible. This problem can be considered in the frame of the fuzzy clustering problem that has many applications in such areas as image processing, bioinformatics, etc. It can also be viewed as a special case of the fuzzy string matching problem. In this paper, we propose the optimal solution to the fuzzy segmentation problem and consider its application to the decomposition of a given function.
••
01 Mar 2019TL;DR: An algorithm to identify users based on check-in and profile matching using fuzzy matching method in profile and Bayesian algorithm inCheck-in achieves high security and privacy to the system.
Abstract: Social networking sites such as Facebook, Twitter, Instagram, and Chat are stronger communication tools. Users use multiple social network applications for different purposes such as business or communications and multiple accounts in different social network applications for each user. In order to identify the same user in an accurate method is important in many services. In this paper, we proposed an algorithm to identify users based on check-in and profile matching using fuzzy matching method in profile and Bayesian algorithm in check-in. This method called “Profile Matching and Check-in” PMC. The proposed method simulated using Python programming language version (2.7) for the Gowalla dataset, the accuracy of the proposed method is 84%. This method achieves high security and privacy to the system.
•
25 Jun 2019
TL;DR: In this paper, a geographic coding method and system based on Jieba word segmentation and an address word bank is presented, which is simple, easy to understand and easy to program and implement.
Abstract: The invention discloses a geographic coding method and system based on Jieba word segmentation and an address word bank. The method comprises the steps: step 1, address data are collected, and an address database is established; step 2, carrying out word segmentation on an address character string input by a user; step 3, performing two rounds of address matching and address standardization; and step 4, mapping the standard address into a geographic coordinate. The system comprises an address database used for storing collected eight-level standard address data and geographic coordinates of the eight-level standard address data; a word segmentation module used for splitting the address character string input by the user; an accurate matching module used for carrying out step-by-step accurate matching on the split address array and complementing the father-level address; a fuzzy matching module used for carrying out fuzzy matching on the inaccurately matched address character strings and completing standardization of addresses; and a mapping module used for mapping the standardized address into geographic coordinates and returning the geographic coordinates to the user. The algorithm is simple, easy to understand and easy to program and implement.
•
04 Jan 2019
TL;DR: In this article, a fuzzy matching method for the photogrammetry of a port crane is presented, which includes the steps of calculating coordinates of an image plane coordinate system of each to-be-measured point as a key point according to an estimated parametric model and left and right exterior orientation elements.
Abstract: The invention discloses a fuzzy matching method for photogrammetry of a port crane. The fuzzy matching method for the photogrammetry of the port crane includes the steps of: calculating coordinates ofan image plane coordinate system of each to-be-measured point as a key point according to an estimated parametric model and left and right exterior orientation elements; respectively getting a simulated left photo to-be-measured point distribution map and a simulated right photo to-be-measured point distribution map according to the calculated plane coordinates of each to-be-measured point in theimage plane coordinate system; conducting a point screening process based on geometric relationship matching, which includes first adopting a feature recognition operator to obtain automatic recognition point maps of left and right photos, first filtering range according to preset matching point fuzzy recognition radius, and then using geometric constraint to select points to obtain a projectionpoint of an object point in the image plane coordinate system; determining whether a match exists; changing the matching point fuzzy recognition radius to expand the selection range when no result output occurs, and returning to iteration until matching results are acquired. The fuzzy matching method for the photogrammetry of the port crane has the advantage of solving special and complex environmental problems of port crane photography.
•
08 Jan 2019
TL;DR: In this paper, the authors proposed a data encryption retrieval system based on fuzzy matching and semantic approximation matching, which is applicable to the field of data encryption, provided that the data encryption method and a device supporting a keyword sorting search technique of fuzzy matching are available.
Abstract: The invention is applicable to the field of data encryption. Provided are a data encryption method and a device thereof supporting a keyword sorting search technique of fuzzy matching and semantic approximation matching, and a data encryption retrieval system, which have complete functions and high efficiency. The initializing step of the method comprises the following steps of: extracting a keyword set from data and establishing an original keyword dictionary, and then establishing a corresponding keyword stem dictionary and a keyword synonym dictionary, and establishing a word vector for fuzzy matching of the keywords in the original keyword dictionary; The secret key generating step of the method comprises the following steps of: generating a corresponding secret key according to a plurality of dictionaries; The index building step comprises: according to dictionary and mapping relation, index vector is built for each document, and clustering is carried out; the data encryption stepcomprises: encrypting the index vector.