scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2014"


Journal ArticleDOI
TL;DR: A novel definition for string stability of nonlinear cascaded systems is proposed, using input-output properties, and is shown to result in well-known string stability conditions for linear cascading systems.
Abstract: Nowadays, throughput has become a limiting factor in road transport. An effective means to increase the road throughput is to employ a small intervehicle time gap using automatic vehicle-following control systems. String stability, i.e., the disturbance attenuation along the vehicle string, is considered an essential requirement for the design of those systems. However, the formal notion of string stability is not unambiguous in literature, since both stability and performance interpretations exist. Therefore, a novel definition for string stability of nonlinear cascaded systems is proposed, using input-output properties. This definition is shown to result in well-known string stability conditions for linear cascaded systems. The theoretical results are experimentally validated using a platoon of six passenger vehicles equipped with cooperative adaptive cruise control.

549 citations


Journal ArticleDOI
TL;DR: A controller design method is developed that allows for explicit inclusion of the string stability requirement in the controller synthesis specifications, and L2 string-stable platooning strategies are obtained in both cases, revealing that the two-vehicle look-ahead topology is particularly effective at a larger communication delay.
Abstract: Cooperative adaptive cruise control (CACC) allows for short-distance automatic vehicle following using intervehicle wireless communication in addition to onboard sensors, thereby potentially improving road throughput. In order to fulfill performance, safety, and comfort requirements, a CACC-equipped vehicle platoon should be string stable, attenuating the effect of disturbances along the vehicle string. Therefore, a controller design method is developed that allows for explicit inclusion of the string stability requirement in the controller synthesis specifications. To this end, the notion of string stability is introduced first, and conditions for L2 string stability of linear systems are presented that motivate the development of an H∞ controller synthesis approach for string stability. The potential of this approach is illustrated by its application to the design of controllers for CACC for one- and two-vehicle look-ahead communication topologies. As a result, L2 string-stable platooning strategies are obtained in both cases, also revealing that the two-vehicle look-ahead topology is particularly effective at a larger communication delay. Finally, the results are experimentally validated using a platoon of three passenger vehicles, illustrating the practical feasibility of this approach.

400 citations


Journal ArticleDOI
TL;DR: This paper approaches the design of a CACC system from a Networked Control System (NCS) perspective and presents an NCS modeling framework that incorporates the effect of sampling, hold, and network delays that occur due to wireless communication and sampled-data implementation of the CACC controller over this wireless link.
Abstract: In this paper, we consider a Cooperative Adaptive Cruise Control (CACC) system, which regulates intervehicle distances in a vehicle string, for achieving improved traffic flow stability and throughput. Improved performance can be achieved by utilizing information exchange between vehicles through wireless communication in addition to local sensor measurements. However, wireless communication introduces network-induced imperfections, such as transmission delays, due to the limited bandwidth of the network and the fact that multiple nodes are sharing the same medium. Therefore, we approach the design of a CACC system from a Networked Control System (NCS) perspective and present an NCS modeling framework that incorporates the effect of sampling, hold, and network delays that occur due to wireless communication and sampled-data implementation of the CACC controller over this wireless link. Based on this network-aware modeling approach, we develop a technique to study the so-called string stability property of the string, in which vehicles are interconnected by a vehicle following control law and a constant time headway spacing policy. This analysis technique can be used to investigate tradeoffs between CACC performance (string stability) and network specifications (such as delays), which are essential in the multidisciplinary design of CACC controllers. Finally, we demonstrate the validity of the presented framework in practice by experiments performed with CACC-equipped prototype vehicles. cop. 2014 IEEE.

318 citations


Journal ArticleDOI
01 Apr 2014
TL;DR: This paper provides a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics.
Abstract: String similarity join is an important operation in data integration and cleansing that finds similar string pairs from two collections of strings. More than ten algorithms have been proposed to address this problem in the recent two decades. However, existing algorithms have not been thoroughly compared under the same experimental framework. For example, some algorithms are tested only on specific datasets. This makes it rather difficult for practitioners to decide which algorithms should be used for various scenarios. To address this problem, in this paper we provide a comprehensive survey on a wide spectrum of existing string similarity join algorithms, classify them into different categories based on their main techniques, and compare them through extensive experiments on a variety of real-world datasets with different characteristics. We also report comprehensive findings obtained from the experiments and provide new insights about the strengths and weaknesses of existing similarity join algorithms which can guide practitioners to select appropriate algorithms for various scenarios.

154 citations


Book ChapterDOI
18 Jul 2014
TL;DR: A set of algebraic techniques for solving constraints over the theory of unbounded strings natively, without reduction to other problems are presented and implemented in the SMT solver cvc4 to expand its already large set of built-in theories to a theory of strings with concatenation, length, and membership in regular languages.
Abstract: An increasing number of applications in verification and security rely on or could benefit from automatic solvers that can check the satisfiability of constraints over a rich set of data types that includes character strings. Unfortunately, most string solvers today are standalone tools that can reason only about (some fragment) of the theory of strings and regular expressions, sometimes with strong restrictions on the expressiveness of their input language. These solvers are based on reductions to satisfiability problems over other data types, such as bit vectors, or to automata decision problems. We present a set of algebraic techniques for solving constraints over the theory of unbounded strings natively, without reduction to other problems. These techniques can be used to integrate string reasoning into general, multi-theory SMT solvers based on the DPLL(T) architecture. We have implemented them in our SMT solver cvc4 to expand its already large set of built-in theories to a theory of strings with concatenation, length, and membership in regular languages. Our initial experimental results show that, in addition, over pure string problems, cvc4 is highly competitive with specialized string solvers with a comparable input language.

148 citations


Proceedings ArticleDOI
03 Nov 2014
TL;DR: This work presents S3, a new symbolic string solver that employs a new algorithm for a constraint language that is expressive enough for widespread applicability and demonstrates both its robustness and its efficiency against the state-of-the-art.
Abstract: Motivated by the vulnerability analysis of web programs which work on string inputs, we present S3, a new symbolic string solver. Our solver employs a new algorithm for a constraint language that is expressive enough for widespread applicability. Specifically, our language covers all the main string operations, such as those in JavaScript. The algorithm first makes use of a symbolic representation so that membership in a set defined by a regular expression can be encoded as string equations. Secondly, there is a constraint-based generation of instances from these symbolic expressions so that the total number of instances can be limited. We evaluate S3 on a well-known set of practical benchmarks, demonstrating both its robustness (more definitive answers) and its efficiency (about 20 times faster) against the state-of-the-art.

133 citations


Patent
03 Jun 2014
TL;DR: In this article, a method for rewriting source text includes receiving source text including a source text string in a first natural language, and then automatically rewriting the source string in the second natural language.
Abstract: A method for rewriting source text includes receiving source text including a source text string in a first natural language. The source text string is translated (S208) with a machine translation system to generate a first target text string in a second natural language. A translation confidence for the source text string is computed (S210), based on the first target text string. At least one alternative text string is generated (S216), where possible, in the first natural language by automatically rewriting the source string. Each alternative string is translated (S218) to generate a second target text string in the second natural language. A translation confidence is computed (S220) for the alternative text string based on the second target string. Based on the computed translation confidences, one of the alternative text strings may be selected as a candidate replacement for the source text string and may be proposed to a user on a graphical user interface.

131 citations


Journal ArticleDOI
TL;DR: This review provides a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies, and an example of analysis by Lempel-Ziv techniques from data compression.
Abstract: Modern sequencing and genome assembly technologies have provided a wealth of data, which will soon require an analysis by comparison for discovery. Sequence alignment, a fundamental task in bioinformatics research, may be used but with some caveats. Seminal techniques and methods from dynamic programming are proving ineffective for this work owing to their inherent computational expense when processing large amounts of sequence data. These methods are prone to giving misleading information because of genetic recombination, genetic shuffling and other inherent biological events. New approaches from information theory, frequency analysis and data compression are available and provide powerful alternatives to dynamic programming. These new methods are often preferred, as their algorithms are simpler and are not affected by synteny-related problems. In this review, we provide a detailed discussion of computational tools, which stem from alignment-free methods based on statistical analysis from word frequencies. We provide several clear examples to demonstrate applications and the interpretations over several different areas of alignment-free analysis such as base-base correlations, feature frequency profiles, compositional vectors, an improved string composition and the D2 statistic metric. Additionally, we provide detailed discussion and an example of analysis by Lempel-Ziv techniques from data compression.

126 citations


Journal ArticleDOI
TL;DR: The results indicate that this approach is effective at finding relevant code, can be used on its own or to filter results from keyword searches to increase search precision, and is adaptable to find approximate matches and then guide modifications to match the user specifications when exact matches do not already exist.
Abstract: Programmers frequently search for source code to reuse using keyword searches. The search effectiveness in facilitating reuse, however, depends on the programmer's ability to specify a query that captures how the desired code may have been implemented. Further, the results often include many irrelevant matches that must be filtered manually. More semantic search approaches could address these limitations, yet existing approaches are either not flexible enough to find approximate matches or require the programmer to define complex specifications as queries. We propose a novel approach to semantic code search that addresses several of these limitations and is designed for queries that can be described using a concrete input/output example. In this approach, programmers write lightweight specifications as inputs and expected output examples. Unlike existing approaches to semantic search, we use an SMT solver to identify programs or program fragments in a repository, which have been automatically transformed into constraints using symbolic analysis, that match the programmer-provided specification. We instantiated and evaluated this approach in subsets of three languages, the Java String library, Yahooe Pipes mashup language, and SQL select statements, exploring its generality, utility, and trade-offs. The results indicate that this approach is effective at finding relevant code, can be used on its own or to filter results from keyword searches to increase search precision, and is adaptable to find approximate matches and then guide modifications to match the user specifications when exact matches do not already exist. These gains in precision and flexibility come at the cost of performance, for which underlying factors and mitigation strategies are identified.

126 citations


Journal ArticleDOI
TL;DR: A first-order linear phase correction model is proposed and it is demonstrated that optimal performance that minimizes asynchrony variance predicts a specific value for the correction gain.
Abstract: Control of relative timing is critical in ensemble music performance. We hypothesize that players respond to and correct asynchronies in tone onsets that arise from fluctuations in their individual...

117 citations


Journal ArticleDOI
TL;DR: In this article, the authors give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words and obtain an upper bound of $3n$ for the maximum sum of exponents of runs in a string of length n, which is the best known bound.
Abstract: We give a new characterization of maximal repetitions (or runs) in strings based on Lyndon words. The characterization leads to a proof of what was known as the "runs" conjecture (Kolpakov \& Kucherov (FOCS '99)), which states that the maximum number of runs $\rho(n)$ in a string of length $n$ is less than $n$. The proof is remarkably simple, considering the numerous endeavors to tackle this problem in the last 15 years, and significantly improves our understanding of how runs can occur in strings. In addition, we obtain an upper bound of $3n$ for the maximum sum of exponents $\sigma(n)$ of runs in a string of length $n$, improving on the best known bound of $4.1n$ by Crochemore et al. (JDA 2012), as well as other improved bounds on related problems. The characterization also gives rise to a new, conceptually simple linear-time algorithm for computing all the runs in a string. A notable characteristic of our algorithm is that, unlike all existing linear-time algorithms, it does not utilize the Lempel-Ziv factorization of the string. We also establish a relationship between runs and nodes of the Lyndon tree, which gives a simple optimal solution to the 2-Period Query problem that was recently solved by Kociumaka et al. (SODA 2015).

Journal ArticleDOI
TL;DR: This research proposes a similarity search of malware to detect these variants using novel distance metrics using a distance metric based on the distance between feature vectors of string-based signatures, and implements the distance metrics in a complete malware variant detection system.
Abstract: Static detection of malware variants plays an important role in system security and control flow has been shown as an effective characteristic that represents polymorphic malware. In our research, we propose a similarity search of malware to detect these variants using novel distance metrics. We describe a malware signature by the set of control flowgraphs the malware contains. We use a distance metric based on the distance between feature vectors of string-based signatures. The feature vector is a decomposition of the set of graphs into either fixed size k-subgraphs, or q-gram strings of the high-level source after decompilation. We use this distance metric to perform pre-filtering. We also propose a more effective but less computationally efficient distance metric based on the minimum matching distance. The minimum matching distance uses the string edit distances between programs’ decompiled flowgraphs, and the linear sum assignment problem to construct a minimum sum weight matching between two sets of graphs. We implement the distance metrics in a complete malware variant detection system. The evaluation shows that our approach is highly effective in terms of a limited false positive rate and our system detects more malware variants when compared to the detection rates of other algorithms.

Patent
John F. Sheets1, Kim Wagner1
29 Jan 2014
TL;DR: In this article, the authors proposed a speaker verification system that allows the use of a captured voice sample attempting to reproduce a word string having a random element to authenticate the user, based on a match score indicating how closely the captured voice samples match to previously stored voice samples of the user and a pass or fail response indicating whether the voice sample is an accurate reproduction of the word string.
Abstract: Embodiments of the invention provide for speaker verification on a communication device without requiring a user to go through a formal registration process with the issuer or network. Certain embodiments allow the use of a captured voice sample attempting to reproduce a word string having a random element to authenticate the user. Authentication of the user is based on both a match score indicating how closely the captured voice samples match to previously stored voice samples of the user and a pass or fail response indicating whether the voice sample is an accurate reproduction of the word string. The processing network maintains a history of the authenticated transactions and voice samples.

Patent
11 Mar 2014
TL;DR: In this paper, computer-implemented systems and methods are disclosed for constructing a parser that parses complex data. And a method is provided for receiving a parser definition as an input to a parser generator and generating a parser at least in part from the parser definition.
Abstract: Computer-implemented systems and methods are disclosed for constructing a parser that parses complex data. In some embodiments, a method is provided for receiving a parser definition as an input to a parser generator and generating a parser at least in part from the parser definition. In some embodiments, the generated parser comprises two or more handlers forming a processing pipeline. In some embodiments, the parser receives as input a first string into the processing pipeline. In some embodiments, the parser generates a second string by a first handler and inputs the second string regeneratively into the parsing pipeline, if the first string matches an expression specified for the first handler in the parser definition.

Patent
20 Jan 2014
TL;DR: In this article, the source phrase selection is based on a translatability score and optionally on fluency and semantic relatedness scores, and a set of candidate phrases is proposed (S114) for display on the authoring interface, each of the candidate phases being the suffix of a respective one of the selected source phrases.
Abstract: An authoring method includes generating an authoring interface configured for assisting a user to author a text string in a source language for translation to a target string in a target language. Initial source text entered (S110) by the user is received through the authoring interface. Source phrases are selected (S112) that each include at least one token of the initial source text as a prefix and at least one other token as a suffix. The source phrase selection is based on a translatability score and optionally on fluency and semantic relatedness scores. A set of candidate phrases is proposed (S114) for display on the authoring interface, each of the candidate phases being the suffix of a respective one of the selected source phrases. The user may select (S116) one of the candidate phrases, which is appended to the source text following its corresponding prefix, or may enter alternative text. The process may be repeated until the user is satisfied with the source text and the SMT model can then be used for its translation.

Journal ArticleDOI
TL;DR: In this paper, the authors show how heterogeneous bidirectional vehicle strings can be modelled as port-Hamiltonian systems, and propose a control law to guarantee string stability with respect to bounded disturbances.

Book ChapterDOI
18 Jul 2014
TL;DR: This work has developed a prototypical implementation of the decision procedure, and integrated it into a CEGAR-based model checker for the analysis of programs encoded as Horn clauses, able to automatically establish the correctness of several programs that are beyond the reach of existing methods.
Abstract: We present a decision procedure for a logic that combines (i)aword equations over string variables denoting words of arbitrary lengths, together with (ii)aconstraints on the length of words, and on (iii)athe regular languages to which words belong. Decidability of this general logic is still open. Our procedure is sound for the general logic, and a decision procedure for a particularly rich fragment that restricts the form in which word equations are written. In contrast to many existing procedures, our method does not make assumptions about the maximum length of words. We have developed a prototypical implementation of our decision procedure, and integrated it into a CEGAR-based model checker for the analysis of programs encoded as Horn clauses. Our tool is able to automatically establish the correctness of several programs that are beyond the reach of existing methods.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: This work shows that there exist commitment, zero-knowledge and general function evaluation protocols with universally composable security, in a model where all parties and all protocols have access to a single, global, random oracle and no other trusted setup.
Abstract: Contrary to prior belief, we show that there exist commitment, zero-knowledge and general function evaluation protocols with universally composable security, in a model where all parties and all protocols have access to a single, global, random oracle and no other trusted setup. This model provides significantly stronger composable security guarantees than the traditional random oracle model of Bellare and Rogaway [CCS'93] or even the common reference string model. Indeed, these latter models provide no security guarantees in the presence of arbitrary protocols that use the {\em same} random oracle (or reference string or hash function). Furthermore, our protocols are highly efficient. Specifically, in the interactive setting, our commitment and general computation protocols are much more efficient than the best known ones due to Lindell [Crypto'11,'13] which are secure in the common reference string model. In the non-interactive setting, our protocols are slightly less efficient than the best known ones presented by Afshar et al. [Eurocrypt '14] but do away with the need to rely on a non-global (programmable) reference string.

Proceedings ArticleDOI
01 Oct 2014
TL;DR: An approach that uses character n-grams as features is proposed for the task of native language identification and has an important advantage in that it is language independent and linguistic theory neutral.
Abstract: A common approach in text mining tasks such as text categorization, authorship identification or plagiarism detection is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. In this work, an approach that uses character n-grams as features is proposed for the task of native language identification. Instead of doing standard feature selection, the proposed approach combines several string kernels using multiple kernel learning. Kernel Ridge Regression and Kernel Discriminant Analysis are independently used in the learning stage. The empirical results obtained in all the experiments conducted in this work indicate that the proposed approach achieves state of the art performance in native language identification, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the proposed approach has an important advantage in that it is language independent and linguistic theory neutral. In the cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state of the art system by 32.3%.

Book ChapterDOI
26 Mar 2014
TL;DR: Groth-Sahai proofs are efficient non-interactive zero-knowledge proofs that have found widespread use in pairing-based cryptography as mentioned in this paper, which is the one that yields the most efficient NIC proofs.
Abstract: Groth-Sahai proofs are efficient non-interactive zero-knowledge proofs that have found widespread use in pairing-based cryptography. We propose efficiency improvements of Groth-Sahai proofs in the SXDH setting, which is the one that yields the most efficient non-interactive zero-knowledge proofs. We replace some of the commitments with ElGamal encryptions, which reduces the prover's computation and for some types of equations reduces the proof size.Groth-Sahai proofs are zero-knowledge when no public elements are paired to each other. We observe that they are also zero-knowledge when base elements for the groups are paired to public constants.The prover's computation can be reduced by letting her pick her own common reference string. By giving a proof she has picked a valid common reference string this does not compromise soundness.We define a type-based commit-and-prove scheme, which allows commitments to be reused in many different proofs.

Journal ArticleDOI
01 Feb 2014
TL;DR: Based on the presented techniques, Stranger, an automata-based string analysis tool for detecting string-related security vulnerabilities in PHP applications is implemented and able to detect known/unknown vulnerabilities, and prove the absence of vulnerabilities with respect to given attack patterns.
Abstract: Verifying string manipulating programs is a crucial problem in computer security. String operations are used extensively within web applications to manipulate user input, and their erroneous use is the most common cause of security vulnerabilities in web applications. We present an automata-based approach for symbolic analysis of string manipulating programs. We use deterministic finite automata (DFAs) to represent possible values of string variables. Using forward reachability analysis we compute an over-approximation of all possible values that string variables can take at each program point. Intersecting these with a given attack pattern yields the potential attack strings if the program is vulnerable. Based on the presented techniques, we have implemented Stranger, an automata-based string analysis tool for detecting string-related security vulnerabilities in PHP applications. We evaluated Stranger on several open-source Web applications including one with 350,000+ lines of code. Stranger is able to detect known/unknown vulnerabilities, and, after inserting proper sanitization routines, prove the absence of vulnerabilities with respect to given attack patterns.

Proceedings ArticleDOI
18 Jun 2014
TL;DR: This work proposes a novel pivotal prefix filter which significantly reduces the number of signatures and develops a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query.
Abstract: We study the string similarity search problem with edit-distance constraints, which, given a set of data strings and a query string, finds the similar strings to the query. Existing algorithms use a signature-based framework. They first generate signatures for each string and then prune the dissimilar strings which have no common signatures to the query. However existing methods involve large numbers of signatures and many signatures are unnecessary. Reducing the number of signatures not only increases the pruning power but also decreases the filtering cost. To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. We prove the pivotal filter achieves larger pruning power and less filtering cost than state-of-the-art filters. We develop a dynamic programming method to select high-quality pivotal prefix signatures to prune dissimilar strings with non-consecutive errors to the query. We propose an alignment filter that considers the alignments between signatures to prune large numbers of dissimilar pairs with consecutive errors to the query. Experimental results on three real datasets show that our method achieves high performance and outperforms the state-of-the-art methods by an order of magnitude.

Journal ArticleDOI
TL;DR: This paper considers the string guessing problem as a generic online problem and shows a lower bound on the number of advice bits needed to obtain a good solution, and uses special reductions from string guessing to improve the best known lower bound for the online set cover problem and to give a lower Bound on the advice complexity of the online maximum clique problem.

Patent
05 Dec 2014
TL;DR: In this article, the authors present a method, apparatus, and program means for performing string comparison operations in a first-and second-string comparison operation, where the first instruction stores a result of a comparison between each data element of a first and second operand corresponding to a first text string, respectively.
Abstract: Method, apparatus, and program means for performing a string comparison operation. In one embodiment, an apparatus includes execution resources to execute a first instruction. In response to the first instruction, said execution resources store a result of a comparison between each data element of a first and second operand corresponding to a first and second text string, respectively.

Proceedings ArticleDOI
03 Nov 2014
TL;DR: This paper presents a procedure for discovering DGAs from Domain Name Service (DNS) query data that works by identifying client IP addresses with an unusual distribution of second-level string lengths in the domain names that they query.
Abstract: In order to detect malware that uses domain fluxing to circumvent blacklisting, it is useful to be able to discover new domain-generation algorithms (DGAs) that are being used to generate algorithmically-generated domains (AGDs). This paper presents a procedure for discovering DGAs from Domain Name Service (DNS) query data. It works by identifying client IP addresses with an unusual distribution of second-level string lengths in the domain names that they query. Running this fairly simple procedure on 5 days' data from a large enterprise network uncovered 19 different DGAs, nine of which have not been identified as previously-known. Samples and statistical information about the DGA domains are given.

Journal ArticleDOI
TL;DR: This work describes a new data structure based on relative Lempel–Ziv compression that is space-efficient and also supports fast pattern searching.

Patent
20 May 2014
TL;DR: In this article, a system and method for computing confidence in an output of a text recognition system includes performing character recognition on an input text image with a text-recognition system to generate a candidate string of characters.
Abstract: A system and method for computing confidence in an output of a text recognition system includes performing character recognition on an input text image with a text recognition system to generate a candidate string of characters. A first representation is generated, based on the candidate string of characters, and a second representation is generated based on the input text image. A confidence in the candidate string of characters is computed based on a computed similarity between the first and second representations in a common embedding space.

Patent
11 Feb 2014
TL;DR: In this article, a photovoltaic system with at least one string (20) of solar modules (30) and a system (12) for disconnecting individual solar modules and safely connecting or reconnecting the disconnected solar modules to the string is described.
Abstract: The invention relates to a photovoltaic system (1) with at least one string (20) of solar modules (30) and a system (12) for disconnecting individual solar modules and safely connecting or reconnecting the disconnected solar modules to the string. The solar junction boxes (12) are at least partially "intelligent" and have a safety circuit (13) which defines an operating state and a safe state. The solar junction boxes are switched from the safe state to the operating state, i.e. are activated, by injecting a starting current (76).

Journal ArticleDOI
TL;DR: A new similarity function is proposed, called fuzzy-token-matching-based similarity which extends token- based similarity functions by allowing fuzzy match between two tokens and achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.
Abstract: String similarity join that finds similar string pairs between two string sets is an essential operation in many applications and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this article, we propose a new similarity function, called fuzzy-token-matching-based similarity which extends token-based similarity functions (e.g., jaccard similarity and cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity function and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. We also extend our techniques to support weighted tokens. Experimental results show that our method achieves high efficiency and result quality and significantly outperforms state-of-the-art approaches.

Proceedings ArticleDOI
09 Jun 2014
TL;DR: This work presents a new approach to model counting for structured data types, specifically strings, that can model count for constraints specified in an expressive string language efficiently and precisely, thereby outperforming previous finite-size analysis tools.
Abstract: Model counting is the problem of determining the number of solutions that satisfy a given set of constraints Model counting has numerous applications in the quantitative analyses of program execution time, information flow, combinatorial circuit designs as well as probabilistic reasoning We present a new approach to model counting for structured data types, specifically strings in this work The key ingredient is a new technique that leverages generating functions as a basic primitive for combinatorial counting Our tool SMC which embodies this approach can model count for constraints specified in an expressive string language efficiently and precisely, thereby outperforming previous finite-size analysis tools SMC is expressive enough to model constraints arising in real-world JavaScript applications and UNIX C utilities We demonstrate the practical feasibility of performing quantitative analyses arising in security applications, such as determining the comparative strengths of password strength meters and determining the information leakage via side channels