scispace - formally typeset
Search or ask a question

Showing papers on "Approximate string matching published in 2018"


Journal ArticleDOI
TL;DR: This article presents a novel matching approach, leveraging a deep neural network to classify pairs of toponyms as either matching or nonmatching, and shows that the proposed method can significantly outperform individual similarity metrics from previous studies, as well as previous methods based on supervised machine learning for combining multiple metrics.
Abstract: Toponym matching, i.e. pairing strings that represent the same real-world location, is a fundamental problemfor several practical applications. The current state-of-the-art relies on string similar...

61 citations


Journal ArticleDOI
TL;DR: A fuzzy quantum image matching scheme based on gray-scale difference is proposed to find out the target region in a reference image, which is very similar to the template image and enables exponentially significant speedup via quantum parallel computation.
Abstract: Quantum image processing has recently emerged as an essential problem in practical tasks, e.g. real-time image matching. Previous studies have shown that the superposition and entanglement of quantum can greatly improve the efficiency of complex image processing. In this paper, a fuzzy quantum image matching scheme based on gray-scale difference is proposed to find out the target region in a reference image, which is very similar to the template image. Firstly, we employ the proposed enhanced quantum representation (NEQR) to store digital images. Then some certain quantum operations are used to evaluate the gray-scale difference between two quantum images by thresholding. If all of the obtained gray-scale differences are not greater than the threshold value, it indicates a successful fuzzy matching of quantum images. Theoretical analysis and experiments show that the proposed scheme performs fuzzy matching at a low cost and also enables exponentially significant speedup via quantum parallel computation.

17 citations


Book ChapterDOI
27 Jun 2018
TL;DR: An intelligent tutoring system for learning English and French concepts is presented that incorporates a novel model for error diagnosis using machine learning and employs two algorithmic techniques.
Abstract: Intelligent computer-assisted language learning employs artificial intelligence techniques to create a more personalized and adaptive environment for language learning. Towards this direction, this paper presents an intelligent tutoring system for learning English and French concepts. The system incorporates a novel model for error diagnosis using machine learning. This model employs two algorithmic techniques and specifically Approximate String Matching and String Meaning Similarity in order to diagnose spelling mistakes, mistakes in the use of tenses, mistakes in the use of auxiliary verbs and mistakes originating from confusion in the simultaneous tutoring of languages. The model for error diagnosis is used by the fuzzy logic model which takes as input the results of the first or the knowledge dependencies existing among the different domain concepts of the learning material and decides dynamically about the learning content that is suitable to be delivered to the learner each time.

15 citations


Journal ArticleDOI
TL;DR: New algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance are proposed and several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure are shown.
Abstract: This paper proposes new algorithms for fixed-length approximate string matching and approximate circular string matching under the Hamming distance. Fixed-length approximate string matching and approximate circular string matching are special cases of approximate string matching and have numerous direct applications in bioinformatics and text searching. Firstly, a counter-vector-mismatches (CVM) algorithm is proposed to solve fixed-length approximate string matching with k-mismatches. The development of CVM algorithm is based on the parallel summation of counters located in the same machine word. Secondly, a parallel counter-vector-mismatches (PCVM) algorithm is proposed to accelerate CVM algorithm in parallel. The PCVM algorithm is integrated into two-level parallelisms that exploit not only word-level parallelism but also data parallelism via parallel environments such as multi-core processors and graphics processing units (GPUs). In the particular case of adopting GPUs, a shared-mem parallel counter-vector-mismatches (PCVMsmem) scheme can be implemented from PCVM algorithm. The PCVMsmem scheme can exploit the memory model of GPUs to optimize performance of PCVM algorithm. Finally, this paper shows several methods to adopt CVM and PCVM algorithms in case the input pattern is in circular structure. In the experiments with real DNA packages, our proposed algorithms and scheme work greatly faster than previous bit-vector-mismatches and parallel bit-vector-mismatches algorithms.

11 citations


Proceedings ArticleDOI
01 Feb 2018
TL;DR: This paper applies a mechanical approach to generate feature vectors from URL strings and can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain.
Abstract: Protecting users from accessing malicious web sites is one of the important management tasks for network operators. There are many open-source and commercial products to control web sites users can access. The most traditional approach is blacklist-based filtering. This mechanism is simple but not scalable, though there are some enhanced approaches utilizing fuzzy matching technologies. Other approaches try to use machine learning (ML) techniques by extracting features from URL strings. This approach can cover a wider area of Internet web sites, but finding good features requires deep knowledge of trends of web site design. Recently, another approach using deep learning (DL) has appeared. The DL approach will help to extract features automatically by investigating a lot of existing sample data. Using this technique, we can build a flexible filtering decision module by keep teaching the neural network module about recent trends, without any specific expert knowledge of the URL domain. In this paper, we apply a mechanical approach to generate feature vectors from URL strings. We implemented our approach and tested with realistic URL access history data taken from a research organization and data from the famous archive site of phishing site information, PhishTank.com. Our approach achieved 2∼3% better accuracy compared to the existing DL- based approach.

8 citations


Journal ArticleDOI
01 Jan 2018
TL;DR: The syntactic service matching methods based on the classical set theory have a role in discovering desired services for composition and orchestration in the vast amount of Web services.
Abstract: The vast amount of Web services brings the problem of discovering desired services for composition and orchestration The syntactic service matching methods based on the classical set theory have a

5 citations


Proceedings ArticleDOI
01 Mar 2018
TL;DR: ReneGENE-DP, implementations of the DP computations on hardware accelerators, with the novelty of realizing traceback in hardware in parallel with the forward scan during analysis, on both FPGA and GPU.
Abstract: Parsing a very long genomic string (human genome is typically 3 billion characters long) abstracts the whole complexity of biocomputing. Approximate String Matching (ASM) is the most eligible computing paradigm that captures the biological complexity of the genome, integrating various sources of biological information into tractable probabilistic models. Though computationally complex, the Dynamic Programming (DP) methodology proves to be very efficient for ASM, in discriminating substantial similarities amongst severe noise in genetic data presented by evolution. Though a significant amount of computations in the DP algorithms are accelerated on multiple platforms, the less complex traceback step is still performed in the host, presenting significant memory and Input/Output bottleneck. With billions of such alignments required to analyse the genomic big data from the Next Generation Sequencing (NGS) Platforms, this bottleneck can severely affect system performance. This paper presents ReneGENE-DP, our implementations of the DP computations on hardware accelerators, with the novelty of realizing traceback in hardware in parallel with the forward scan during analysis, on both FPGA and GPU. The fastest FPGA implementation is around 43.63x better than the fastest GPU implementation of ReneGENE-DP, which in turn, is 380.85x faster than the reference design, which is a GPU based DP algorithm with traceback on host.

5 citations


Proceedings ArticleDOI
01 Aug 2018
TL;DR: This paper proposes a wildcard-based multi-keyword fuzzy search (WMFS) scheme over the encrypted data, which tolerates keyword misspellings by exploiting the indecomposable property of primes.
Abstract: One critical challenge of today's cloud services is how to provide an effective search service while preserving user privacy In this paper, we propose a wildcard-based multi-keyword fuzzy search (WMFS) scheme over the encrypted data, which tolerates keyword misspellings by exploiting the indecomposable property of primes Compared with existing secure fuzzy search schemes, our WMFS scheme has the following merits: 1) Efficiency It eliminates the requirement of a predefined dictionary and thus supports updates efficiently 2) High accuracy It eliminates the false positive and false negative introduced by specific data structures and thus allows the user to retrieve files as accurate as possible 3) Flexibility It gives the user great flexibility to specify different search patterns including keyword and substring matching Extensive experiments on a real data set demonstrate the effectiveness and efficiency of our scheme

5 citations


Patent
25 Oct 2018
TL;DR: In this article, a system can identify a corresponding string stored in memory based on an incomplete input string by analyzing phonetic and distance metrics for a plurality of strings stored in the memory.
Abstract: Systems, apparatuses, and methods are provided for identifying a corresponding string stored in memory based on an incomplete input string A system can analyze and produce phonetic and distance metrics for a plurality of strings stored in memory by comparing the plurality of strings to an incomplete input string These similarity metrics can be used as the input to a machine learning model, which can quickly and accurately provide a classification This classification can be used to identify a string stored in memory that corresponds to the incomplete input string

5 citations


Posted Content
TL;DR: The Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata, which leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches.
Abstract: Integrating data from different sources has become a fundamental component of modern data analytics. Record linkage methods represent an important class of tools for accomplishing such integration. In the absence of common disambiguated identifiers, researchers often must resort to ''fuzzy" matching, which allows imprecision in the characteristics used to identify common entities across dfferent datasets. While the record linkage literature has identified numerous individually useful fuzzy matching techniques, there is little consensus on a way to integrate those techniques within a single framework. To this end, we introduce the Multiple Algorithm Matching for Better Analytics (MAMBA), an easy-to-use, flexible, scalable, and transparent software platform for business record linkage applications using Census microdata. MAMBA leverages multiple string comparators to assess the similarity of records using a machine learning algorithm to disambiguate matches. This software represents a transparent tool for researchers seeking to link external business data to the Census Business Register files.

4 citations


Journal ArticleDOI
TL;DR: The experimental results have shown that the proposed string matching algorithms performs very well compared to those of Brute force, KMP and Boyer moorestring matching algorithms.
Abstract: String Matching is a technique of searching a pattern in a text. It is the basic concept to extract the fruitful information from large volume of text, which is used in different applications like text processing, information retrieval, text mining, pattern recognition, DNA sequencing and data cleaning etc., . Though it is stated some of the simple mechanisms perform very well in practice, plenty of research has been published on the subject and research is still active in this area and there are ample opportunities to develop new techniques. For this purpose, this paper has proposed linear array based string matching, string matching with butterfly model and string matching with divide and conquer models for sequential and parallel environments. To assess the efficiency of the proposed models, the genome sequences of different sizes (10–100 Mb) are taken as input data set. The experimental results have shown that the proposed string matching algorithms performs very well compared to those of Brute force, KMP and Boyer moore string matching algorithms.

Posted Content
TL;DR: This paper presents the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors, and shows that under the assumptions of equiprobability and independence of characters the algorithm has a $O(n\log^2_{\sigma} m)$ average time complexity.
Abstract: Unbalanced translocations are among the most frequent chromosomal alterations, accounted for 30\% of all losses of heterozygosity, a major genetic event causing inactivation of tumor suppressor genes. Despite of their central role in genomic sequence analysis, little attention has been devoted to the problem of matching sequences allowing for this kind of chromosomal alteration. In this paper we investigate the \emph{approximate string matching} problem when the edit operations are non-overlapping unbalanced translocations of adjacent factors. In particular, we first present a $O(nm^3)$-time and $O(m^2)$-space algorithm based on the dynamic-programming approach. Then we improve our first result by designing a second solution which makes use of the Directed Acyclic Word Graph of the pattern. In particular, we show that under the assumptions of equiprobability and independence of characters, our algorithm has a $O(n\log^2_{\sigma} m)$ average time complexity, for an alphabet of size $\sigma$, still maintaining the $O(nm^3)$-time and the $O(m^2)$-space complexity in the worst case. To the best of our knowledge this is the first solution in literature for the approximate string matching problem allowing for unbalanced translocations of factors.

Journal ArticleDOI
TL;DR: The study highlights the importance of data accuracy in network analysis and improves approximate string matching techniques to produce reliable network data-sets with more than 98% precision and recall.
Abstract: This paper concerns an Information Extraction process for building a dynamic Legislation Network from legal documents. Unlike supervised learning approaches which require additional calculations, the idea here is to apply Information Extraction methodologies by identifying distinct expressions in legal text and extract quality network information. The study highlights the importance of data accuracy in network analysis and improves approximate string matching techniques for producing reliable network data-sets with more than 98 percent precision and recall. The values, applications, and the complexity of the created dynamic Legislation Network are also discussed and challenged.

Posted ContentDOI
13 Apr 2018-bioRxiv
TL;DR: In this paper, a mixed integer program (MIP) is proposed to solve the optimization problem for Hamming distance with given number of pieces, which significantly improves the performance of search in bidirectional FM-index.
Abstract: Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the search to start from anywhere within the pattern and extend in both directions. In particular, use of search schemes (partitioning the pattern and searching the pieces in certain orders with given bounds on errors) can yield significant speed-ups. However, finding optimal search schemes is a difficult combinatorial optimization problem. Here for the first time, we propose a mixed integer program (MIP) capable to solve this optimization problem for Hamming distance with given number of pieces. Our experiments show that the optimal search schemes found by our MIP significantly improve the performance of search in bidirectional FM-index upon previous ad-hoc solutions. For example, approximate matching of 101-bp Illumina reads (with two errors) becomes 35 times faster than standard backtracking. Moreover, despite being performed purely in the index, the running time of search using our optimal schemes (for up to two errors) is comparable to the best state-of-the-art aligners, which benefit from combining search in index with in-text verification using dynamic programming. As a result, we anticipate a full-fledged aligner that employs an intelligent combination of search in the bidirectional FM-index using our optimal search schemes and in-text verification using dynamic programming outperforms today9s best aligners. The development of such an aligner, called FAMOUS (Fast Approximate string Matching using OptimUm search Schemes), is ongoing as our future work.

Patent
05 Jan 2018
TL;DR: In this paper, a method and device for character string matching with fuzzy nodes is described, where the AC state machine generates regular nodes based on the non-wildcard relationship between the characters contained in each rule character string.
Abstract: The invention discloses a method and device for character string matching. The method includes the steps that an AC state machine with fuzzy nodes is initialized, wherein the AC state machine generates regular nodes based on the non-wildcard relationship between the characters contained in each rule character string, and generates the corresponding fuzzy nodes according to the wildcard relationship between characters; target character strings are entered into the AC state machine, each character in the target character strings is compared with the corresponding character of each node in the ACstate machine, one or more rule character strings matching the target strings is determined, and a corresponding operation is performed according to the matched rule character strings. According to the technical scheme, after obtaining the target character strings, the target character strings are entered into the AC state machine to match, the matched one or more rule character strings in the target character string is determined, the multi-segment fuzzy matching is achieved, the flexible ability defined by the rule character strings is guaranteed, and the application demand is satisfied.

Journal ArticleDOI
TL;DR: A deep learning neural network for character-level text classification of noisy text spots keywords in the text output of an optical character recognition system using memoization and by encoding the text into feature vectors related to letter frequency.
Abstract: A deep learning neural network for character-level text classification is described in this work. The system spots keywords in the text output of an optical character recognition system using memoi...

Journal ArticleDOI
TL;DR: This paper displays the review of single pattern parameterized pattern matching algorithms using q-grams having linear time complexities and discusses about the best algorithm that has less number of false matches.
Abstract: Identification of candidate genes and nucleotides are basic uses of the Bioinformatics research. Biology that deals with molecule has functional as well as structural behavior imparting need of wel...

Proceedings ArticleDOI
25 Aug 2018
TL;DR: A multi-space keyword fuzzy query algorithm that is converted to the Morton code matching to improve the query efficiency, and with the fuzzy matching algorithm to support query fault tolerance is proposed.
Abstract: With the large-scale use of smart devices with positioning capabilities, more and more spatial data is generated and each piece of data contains more and more information. However, the past space keyword query algorithm only for a single keyword query, which cannot meet the users' more personalized needs. Therefore, this paper proposes a multi-space keyword fuzzy query algorithm. In this algorithm, the past two-dimensional space distance calculation is converted to the Morton code matching to improve the query efficiency, and with the fuzzy matching algorithm to support query fault tolerance. Experimental results show that the proposed algorithm improves query efficiency and users' satisfaction.

Patent
01 May 2018
TL;DR: In this paper, a fuzzy matching method for an equipment model name was proposed, which aims at the particularity of the equipment name, a similar distance between the character strings of two equipment names was calculated through an improved Jaro-Winkler algorithm, and a threshold value was set to judge whether the character string of two names expressed the same equipment model or not.
Abstract: The invention discloses a fuzzy matching method for an equipment model name. The method aims at the particularity of the equipment name, a similar distance between the character strings of two equipment names is calculated through an improved Jaro-Winkler algorithm, and a threshold value is set to judge whether the character strings of two equipment names express the same equipment model or not. By use of the method, differences expressed by Chinese characters and Pinyin initials are shielded, meanwhile, an influence of a digital mark number for modal name matching is improved, and whether thecharacter strings of the equipment name are matched or not can be accurately judged.


Book ChapterDOI
02 Sep 2018
TL;DR: This work presents an approach that transforms the dictionary and each input token into a compact well-known phonetic representation and applies a second similarity measure to filter the best result to annotate a given entity.
Abstract: Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.

Book ChapterDOI
Kensuke Baba1
01 Jan 2018
TL;DR: The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.
Abstract: Plagiarism detection for a huge amount of document data requires efficient methods. This paper proposes a plagiarism detection algorithm based on approximate string matching and vector representation of words, and a speed improvement to an implementation of the algorithm. The effect of the improvement on the algorithm is evaluated by conducting experiments with a dataset. The experimental results show a tradeoff between the processing time and the accuracy of the plagiarism detection algorithm, which enable us to configure its implementation in accordance with a given data space and a required accuracy.

Patent
08 Nov 2018
TL;DR: In this paper, a comparison length based on lengths of strings in a set of strings is used to normalize strings to identify computing resources and improve performance and utilization of computing resources.
Abstract: Systems and methods are disclosed for normalizing strings to identify computing resources and improve performance and utilization of computing resources. For example, methods may include determining a comparison length based on lengths of strings in a set of strings; padding a first string from the set of strings to the comparison length to obtain a padded string; receiving a second string; determining a distance between the second string and the padded string; and identifying a match between the first string and the second string based on the distance.

Patent
Xu Bo, Jiang Jing, Li Kunjian, Pan Feng, Zhu Jian 
25 May 2018
TL;DR: In this paper, a drawing similarity comparison system is presented, which comprises a building module for obtaining feature library pictures and pictures to be retrieved and obtaining a low frequency component through calculation of Fourier transform.
Abstract: The invention discloses a drawing similarity comparison system. The system comprises a building module for obtaining feature library pictures and pictures to be retrieved and obtaining a low frequencycomponent through calculation of Fourier transform, a unification processing module for enabling the obtained low frequency component to generate corresponding feature indexing character string data,and a data processor for importing the feature indexing character string data of the feature library pictures into a database and encoding the feature library pictures and providing matched samples for retrieval. Feature character strings are obtained through Fourier transform, only front-to-back comparison is needed for character string similarity comparison, and the influence of the rear character on the similarity is only one half that of the previous character; keywords in the database can be directly adopted for fuzzy search, the retrieval efficiency is high, and the calculated quantityis small.

Book ChapterDOI
03 Jan 2018
TL;DR: This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches.
Abstract: Fuzzy search is often used in digital forensic investigations to find words that are stringologically similar to a chosen keyword. However, a common complaint is the high rate of false positives in big data environments. This chapter describes the design and implementation of cedas, a novel constrained edit distance approximate string matching algorithm that provides complete control over the types and numbers of elementary edit operations considered in approximate matches. The unique flexibility of cedas facilitates fine-tuned control of precision-recall trade-offs. Specifically, searches can be constrained to the union of matches resulting from any exact edit combination of insertion, deletion and substitution operations performed on the search term. The flexibility is leveraged in experiments involving fuzzy searches of an inverted index of the Enron corpus, a large English email dataset, which reveal the specific edit operation constraints that should be applied to achieve valuable precision-recall trade-offs. The constraints that produce relatively high combinations of precision and recall are identified, along with the combinations of edit operations that cause precision to drop sharply and the combination of edit operation constraints that maximize recall without sacrificing precision substantially. These edit operation constraints are potentially valuable during the middle stages of a digital forensic investigation because precision has greater value in the early stages of an investigation while recall becomes more valuable in the later stages.

Patent
20 Jul 2018
TL;DR: In this paper, a sensor node identifier platform-based information management and analysis method is proposed to deal with the data information construction requirements of an industrial sensing network, where the sensor information is effectively managed and analyzed.
Abstract: The invention discloses a sensor node identifier platform-based information management and analysis method. The method comprises the following steps of 1) extracting sensor information data; 2) performing database construction on the extracted sensor information data; 3) matching the sensor information data with sensor node identifiers by applying a fuzzy matching algorithm; and 4) performing correlation analysis on sensor information by applying a data correlation analysis algorithm. The sensor information is effectively managed and analyzed, so that the data information construction requirements of an industrial sensing network can be met.

Patent
03 Jul 2018
TL;DR: In this article, a system, method, and computer program product embodiment for detecting duplicates with exact and fuzzy matching on encrypted match indexes using an encryption key in a cloud computing platform is presented.
Abstract: Disclosed herein are system, method, and computer program product embodiments for detecting duplicates with exact and fuzzy matching on encrypted match indexes using an encryption key in a cloud computing platform. An embodiment operates by determining a match rule index value upon reception of a new record. The embodiment encrypts the match index rule value using the customer's encryption key and a deterministic encryption method and stores the encrypted match rule index value. Duplicate detection may be later performed by using the same deterministic encryption method to determine a cypher text for a candidate entry and comparing the ciphertext to the stored encrypted match indexes.

Patent
17 Aug 2018
TL;DR: In this paper, a power grid data association method based on an address matching technology is proposed, which achieves the aim that repair and complaint work orders without client numbers are associated to basic file information of clients through an address fuzzy matching technology.
Abstract: The invention relates to a power grid data association method based on an address matching technology. The method achieves the aim that repair and complaint work orders without client numbers are associated to basic file information of clients through an address fuzzy matching technology. The power grid data association method comprises the following steps that pre-processing is conducted; repairaddress information of the clients is received and stored as text information; structured address information, special characters and Arabic numbers are deleted; Chinese address character strings aresubjected to editing distance calculation; a Chinese address character string of a user corresponding to a minimum calculation result is determined as a repair address of the user. Compared with the prior art, the power grid data association method has the advantages that a mode for achieving structured data association by using non-structured data association is provided, and fusion analysis of trans-disciplinary data is achieved; an address matching degree is calculated based on a minimum editing distance algorithm, adoption of Chinese word segmentation is avoided, and the misjudgment probability is lowered.

Journal ArticleDOI
TL;DR: In this paper, the authors presented an improved Turkish dataset with an effective spelling correction algorithm based on Hadoop, which can be used as an open source dataset in sentiment analysis studies, have been performed successfully to the detection and correction of spelling errors.
Abstract: A public dataset, with a variety of properties suitable for sentiment analysis [1], event prediction, trend detection and other text mining applications, is needed in order to be able to successfully perform analysis studies. The vast majority of data on social media is text-based and it is not possible to directly apply machine learning processes into these raw data, since several different processes are required to prepare the data before the implementation of the algorithms. For example, different misspellings of same word enlarge the word vector space unnecessarily, thereby it leads to reduce the success of the algorithm and increase the computational power requirement. This paper presents an improved Turkish dataset with an effective spelling correction algorithm based on Hadoop [2]. The collected data is recorded on the Hadoop Distributed File System and the text based data is processed by MapReduce programming model. This method is suitable for the storage and processing of large sized text based social media data. In this study, movie reviews have been automatically recorded with Apache ManifoldCF (MCF) [3] and data clusters have been created. Various methods compared such as Levenshtein and Fuzzy String Matching have been proposed to create a public dataset from collected data. Experimental results show that the proposed algorithm, which can be used as an open source dataset in sentiment analysis studies, have been performed successfully to the detection and correction of spelling errors.

Patent
02 Oct 2018
TL;DR: In this article, a character identification method, device, equipment and a readable storage medium is presented, which consists of the following steps: preprocessing a gathered to-be-identified image, and obtaining characters in the to-besidentified image; respectively matching the characters with templates in a pre-built fuzzy matching template database, thus obtaining an initial matching result and a matching rate of each character in the TO-BE-ID image; analyzing structure characteristics of characters with a plurality of matching results and/or low matching rate in TOI image; finally obtaining the final matching result
Abstract: The invention discloses a character identification method, device, equipment and a readable storage medium; the method comprises the following steps: preprocessing a gathered to-be-identified image, and obtaining characters in the to-be-identified image; respectively matching the characters in the to-be-identified image with templates in a pre-built fuzzy matching template database, thus obtainingan initial matching result and a matching rate of each character in the to-be-identified image; analyzing structure characteristics of characters with a plurality of matching results and/or low matching rate in the to-be-identified image, thus obtaining the final matching result of the characters with the plurality of matching results and/or low matching rate in the to-be-identified image The character identification method, device, equipment and the readable storage medium can improve the character identification accuracy and efficiency