scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2013"


Proceedings ArticleDOI
18 Aug 2013
TL;DR: A general purpose string solver, called Z3-str, is developed as an extension of the Z3 SMT solver through its plug-in interface, which treats strings as a primitive type, thus avoiding the inherent limitations observed in many existing solvers that encode strings in terms of other primitives.
Abstract: Analyzing web applications requires reasoning about strings and non-strings cohesively. Existing string solvers either ignore non-string program behavior or support limited set of string operations. In this paper, we develop a general purpose string solver, called Z3-str, as an extension of the Z3 SMT solver through its plug-in interface. Z3-str treats strings as a primitive type, thus avoiding the inherent limitations observed in many existing solvers that encode strings in terms of other primitives. The logic of the plug-in has three sorts, namely, bool, int and string. The string-sorted terms include string constants and variables of arbitrary length, with functions such as concatenation, sub-string, and replace. The int-sorted terms are standard, with the exception of the length function over string terms. The atomic formulas are equations over string terms, and (in)-equalities over integer terms. Not only does our solver have features that enable whole program symbolic, static and dynamic analysis, but also it performs better than other solvers in our experiments. The application of Z3-str in remote code execution detection shows that its support of a wide spectrum of string operations is key to reducing false positives.

205 citations


Book ChapterDOI
21 Oct 2013
TL;DR: It is shown that if optimal string similarity metrics are chosen, those alone can produce alignments that are competitive with the state of the art in ontology alignment systems.
Abstract: Ontology alignment is an important part of enabling the semantic web to reach its full potential. The vast majority of ontology alignment systems use one or more string similarity metrics, but often the choice of which metrics to use is not given much attention. In this work we evaluate a wide range of such metrics, along with string pre-processing strategies such as removing stop words and considering synonyms, on different types of ontologies. We also present a set of guidelines on when to use which metric. We furthermore show that if optimal string similarity metrics are chosen, those alone can produce alignments that are competitive with the state of the art in ontology alignment systems. Finally, we examine the improvements possible to an existing ontology alignment system using an automated string metric selection strategy based upon the characteristics of the ontologies to be aligned.

183 citations


Patent
Kevin A. Gibbs1
24 Sep 2013
TL;DR: In this paper, a set of ordered predicted completion strings including strings of ideographs are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar).
Abstract: A set of ordered predicted completion strings including strings of ideographs are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar). The user entered text may include zero or more ideographs followed by one or more phonetic characters, or the entered text may be one or more. The predicted completion strings can be in the form of URLs or query strings. The ordering may be based on any number of factors (e.g., a query's frequency of submission from a community of users). URLs can be ranked based on an importance value of the URL. The sets of ordered predicted completion strings are obtained by matching a fingerprint value of the user's entry string to a fingerprint to table map which contains the set of ordered predicted completion strings. The generation of the ordered prediction strings takes into account multiple phonetic representations of certain strings of ideographs.

169 citations


Journal ArticleDOI
TL;DR: An integrated technique for simultaneous reaction path and exact transition state search is described by implementing an eigenvector following optimization algorithm in internal coordinates with Hessian update techniques.
Abstract: The growing string method (GSM) is highly useful for locating reaction paths connecting two molecular intermediates. GSM has often been used in a two-step procedure to locate exact transition states (TS), where GSM creates a quality initial structure for a local TS search. This procedure and others like it, however, do not always converge to the desired transition state because the local search is sensitive to the quality of the initial guess. This article describes an integrated technique for simultaneous reaction path and exact transition state search. This is achieved by implementing an eigenvector following optimization algorithm in internal coordinates with Hessian update techniques. After partial convergence of the string, an exact saddle point search begins under the constraint that the maximized eigenmode of the TS node Hessian has significant overlap with the string tangent near the TS. Subsequent optimization maintains connectivity of the string to the TS as well as locks in the TS direction, al...

153 citations


Book ChapterDOI
Charanjit S. Jutla1, Arnab Roy2
01 Dec 2013
TL;DR: A novel notion of quasi-adaptive non-interactive zero knowledge NIZK proofs for probability distributions on parametri- zed languages and shows that the system can be extended to include integer tags in the defining equations, where the tags are provided adaptively by the adversary.
Abstract: We define a novel notion of quasi-adaptive non-interactive zero knowledge NIZK proofs for probability distributions on parametri- zed languages. It is quasi-adaptive in the sense that the common reference string CRS generator can generate the CRS depending on the language parameters. However, the simulation is required to be uniform, i.e., a single efficient simulator should work for the whole class of parametrized languages. For distributions on languages that are linear subspaces of vector spaces over bilinear groups, we give quasi-adaptive computationally sound NIZKs that are shorter and more efficient than Groth-Sahai NIZKs. For many cryptographic applications quasi-adaptive NIZKs suffice, and our constructions can lead to significant improvements in the standard model. Our construction can be based on any k-linear assumption, and in particular under the eXternal Diffie Hellman XDH assumption our proofs are even competitive with Random-Oracle based Σ-protocol NIZK proofs. We also show that our system can be extended to include integer tags in the defining equations, where the tags are provided adaptively by the adversary. This leads to applicability of our system to many applications that use tags, e.g. applications using Cramer-Shoup projective hash proofs. Our techniques also lead to the shortest known ciphertext fully secure identity based encryption IBE scheme under standard static assumptions SXDH. Further, we also get a short publicly-verifiable CCA2-secure IBE scheme.

146 citations


Proceedings ArticleDOI
18 May 2013
TL;DR: A topic modeling analysis that combines question concepts, types, and code is presented to associate programming concepts and identifiers with particular types of questions, such as, “how to perform encoding”.
Abstract: Questions from Stack Overflow provide a unique opportunity to gain insight into what programming concepts are the most confusing. We present a topic modeling analysis that combines question concepts, types, and code. Using topic modeling, we are able to associate programming concepts and identifiers (like the String class) with particular types of questions, such as, “how to perform encoding”.

135 citations


Journal ArticleDOI
TL;DR: A detailed description of the generation of internal coordinates suitable for use in GSM as reactive tangents and in string optimization is given, and a climbing image scheme is included to improve the quality of the transition state approximation, ensuring high reliability of the method.
Abstract: The growing string method (GSM) has proven especially useful for locating chemical reaction paths at low computational cost. While many string methods use Cartesian coordinates, these methods can be substantially improved by changes in the coordinate system used for interpolation and optimization steps. The quality of the interpolation scheme is especially important because it determines how close the initial path is to the optimized reaction path, and this strongly affects the rate of convergence. In this article, a detailed description of the generation of internal coordinates (ICs) suitable for use in GSM as reactive tangents and in string optimization is given. Convergence of reaction paths is smooth because the IC tangent and orthogonal directions are better representations of chemical bonding compared to Cartesian coordinates. This is not only important quantitatively for reducing computational cost but also allows reaction paths to be described with smoothly varying chemically relevant coordinates. Benchmark computations with challenging reactions are compared to previous versions of GSM and show significant speedups. Finally, a climbing image scheme is included to improve the quality of the transition state approximation, ensuring high reliability of the method.

133 citations


Journal ArticleDOI
TL;DR: This work presents a methodology for computing the Burrows-Wheeler transform (BWT) of a string collection in a lightweight fashion, and gives two algorithms for recovering the strings in a collection from its BWT.

104 citations


Journal ArticleDOI
TL;DR: The authors argue for audacious pedagogies of speculative fabulation, arguing that the kinds of pedagogical endeavours that times of uncertainty call for are by no means straightforward, calling as I argue along with Elizabeth de Freitas (2020) writing in this issue, for more venturesome approaches informed by speculative posthuman inquiries and exploratory new materialisms.
Abstract: In resistance to capitalist logics of speculation, this article argues for audacious pedagogies of speculative fabulation. The kinds of pedagogical endeavours that times of uncertainty call for are by no means straightforward, calling as I argue along with Elizabeth de Freitas (2020) writing in this issue, for more venturesome approaches informed by speculative posthuman inquiries and exploratory new materialisms. The Anthropocene or Capitalocene are terms that capture equivocal nature of the crisis-riven present. Laden with contradiction and destruction, these descriptors also embody strange afterlives. Beyond problematic present/futures produced by humans only for themselves lie intimate and uncanny sympoiesis, world-buildings and meaningmakings with non-human others and more than human processes. In accounting for these as well as for the already entangled material conditions of our time, pedagogy needs to pay attention to the slippery nature of cognition itself; a task to which the genre of science-fiction or speculative fabulation (SF) is primed.

94 citations


Journal ArticleDOI
TL;DR: A novel algorithm to detect text information from natural scene images that achieves the state-of-the-art performance on scene text classification and detection, and significantly outperforms the existing algorithms for character identification.

94 citations


Journal ArticleDOI
TL;DR: This letter describes a chain-of-states method that optimizes reaction paths under the sole constraint of equally spaced structures that requires no spring forces, interpolation algorithms, or other heuristics to control structure distribution.
Abstract: This letter describes a chain-of-states method that optimizes reaction paths under the sole constraint of equally spaced structures. In contrast to NEB and string methods, it requires no spring forces, interpolation algorithms, or other heuristics to control structure distribution. Rigorous use of a quadratic PES allows calculation of an optimization step with a predefined distribution in Cartesian space. The method is a formal extension of single-structure quasi-Newton methods. An initial guess can be evolved, as in the growing string method.

Proceedings ArticleDOI
18 Mar 2013
TL;DR: An approach in which a natural language model is incorporated into a search-based input data generation process with the aim of improving the human readability of generated strings is presented.
Abstract: The frequent non-availability of an automated oracle means that, in practice, checking software behaviour is frequently a painstakingly manual task. Despite the high cost of human oracle involvement, there has been little research investigating how to make the role easier and less time-consuming. One source of human oracle cost is the inherent unreadability of machine-generated test inputs. In particular, automatically generated string inputs tend to be arbitrary sequences of characters that are awkward to read. This makes test cases hard to comprehend and time-consuming to check. In this paper we present an approach in which a natural language model is incorporated into a search-based input data generation process with the aim of improving the human readability of generated strings. We further present a human study of test inputs generated using the technique on 17 open source Java case studies. For 10 of the case studies, the participants recorded significantly faster times when evaluating inputs produced using the language model, with medium to large effect sizes 60% of the time. In addition, the study found that accuracy of test input evaluation was also significantly improved for 3 of the case studies.

Journal ArticleDOI
TL;DR: A forward-backward lattice pruning algorithm is proposed to reduce the computation in training when trigram language models are used, and beam search techniques are investigated to accelerate the decoding speed.
Abstract: This paper proposes a method for handwritten Chinese/Japanese text (character string) recognition based on semi-Markov conditional random fields (semi-CRFs). The high-order semi-CRF model is defined on a lattice containing all possible segmentation-recognition hypotheses of a string to elegantly fuse the scores of candidate character recognition and the compatibilities of geometric and linguistic contexts by representing them in the feature functions. Based on given models of character recognition and compatibilities, the fusion parameters are optimized by minimizing the negative log-likelihood loss with a margin term on a training string sample set. A forward-backward lattice pruning algorithm is proposed to reduce the computation in training when trigram language models are used, and beam search techniques are investigated to accelerate the decoding speed. We evaluate the performance of the proposed method on unconstrained online handwritten text lines of three databases. On the test sets of databases CASIA-OLHWDB (Chinese) and TUAT Kondate (Japanese), the character level correct rates are 95.20 and 95.44 percent, and the accurate rates are 94.54 and 94.55 percent, respectively. On the test set (online handwritten texts) of ICDAR 2011 Chinese handwriting recognition competition, the proposed method outperforms the best system in competition.

Book ChapterDOI
02 Sep 2013
TL;DR: P succinct and compact representations of the bidirectional bwt of a string s ∈ Σ* which provide increasing navigation power and a number of space-time tradeoffs are described, resulting in near-linear time algorithms for many sequence analysis problems for the first time in succinct space.
Abstract: We describe succinct and compact representations of the bidirectional bwt of a string s ∈ Σ* which provide increasing navigation power and a number of space-time tradeoffs. One such representation allows to extend a substring of s by one character from the left and from the right in constant time, taking O(|s| log |Σ|) bits of space. We then match the functions supported by each representation to a number of algorithms that traverse the nodes of the suffix tree of s, exploiting connections between the bwt and the suffix-link tree. This results in near-linear time algorithms for many sequence analysis problems (e.g. maximal unique matches), for the first time in succinct space.

Posted Content
TL;DR: Order-preserving matching on numeric strings was introduced in this article, where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern.
Abstract: We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves. Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For the single pattern case, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For the multiple pattern case, we give an O(n log m) time algorithm.

Proceedings ArticleDOI
08 Apr 2013
TL;DR: This paper proposes a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance, and develops a range-based method by grouping the pivotal entries to avoid duplicated computations.
Abstract: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

Proceedings ArticleDOI
04 Nov 2013
TL;DR: DIGLOSSIA accurately detects both SQL and NoSQL code injection attacks while avoiding the false positives and false negatives of prior methods, and recasts the problem of detecting injected code as a string propagation and parsing problem, gaining substantial improvements in efficiency and precision over prior work.
Abstract: Code injection attacks continue to plague applications that incorporate user input into executable programs. For example, SQL injection vulnerabilities rank fourth among all bugs reported in CVE, yet all previously proposed methods for detecting SQL injection attacks suffer from false positives and false negatives.This paper describes the design and implementation of DIGLOSSIA, a new tool that precisely and efficiently detects code injection attacks on server-side Web applications generating SQL and NoSQL queries. The main problems in detecting injected code are (1) recognizing code in the generated query, and (2) determining which parts of the query are tainted by user input. To recognize code, DIGLOSSIA relies on the precise definition due to Ray and Ligatti. To identify tainted characters, DIGLOSSIA dynamically maps all application-generated characters to shadow characters that do not occur in user input and computes shadow values for all input-dependent strings. Any original characters in a shadow value are thus exactly the taint from user input.Our key technical innovation is dual parsing. To detect injected code in a generated query, DIGLOSSIA parses the query in tandem with its shadow and checks that (1) the two parse trees are syntactically isomorphic, and (2) all code in the shadow query is in shadow characters and, therefore, originated from the application itself, as opposed to user input.We demonstrate that DIGLOSSIA accurately detects both SQL and NoSQL code injection attacks while avoiding the false positives and false negatives of prior methods. By recasting the problem of detecting injected code as a string propagation and parsing problem, we gain substantial improvements in efficiency and precision over prior work. Our approach does not require any changes to the databases, Web servers, or Web browsers, adds virtually unnoticeable performance overhead, and is deployable today.

Journal ArticleDOI
Ge Nong1
TL;DR: In this experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA), and is around 33% faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA.
Abstract: This article presents an O(n)-time algorithm called SACA-K for sorting the suffixes of an input string T[0, n-1] over an alphabet A[0, K-1]. The problem of sorting the suffixes of T is also known as constructing the suffix array (SA) for T. The theoretical memory usage of SACA-K is nlogK p nlogn p Klogn bits. Moreover, we also have a practical implementation for SACA-K that uses n bytes p (n p 256) words and is suitable for strings over any alphabet up to full ASCII, where a word is log n bits. In our experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA). SACA-K is around 33p faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA. Given K=O(1), SACA-K runs in linear time and O(1) workspace. To the best of our knowledge, such a result is the first reported in the literature with a practical source code publicly available.

Journal ArticleDOI
TL;DR: In this article, a model of photovoltaic (PV) fields in mismatching conditions presented in this paper is a tradeoff between accuracy and calculation time, which profits from the possibility to express each PV module voltage as an explicit function of the current by using the Lambert-W function.

Proceedings ArticleDOI
13 May 2013
TL;DR: This paper presents three different trie-based data structures to address the case where the string set is so large that compression is needed to fit the data structure in memory, and shows that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data.
Abstract: Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.

Proceedings ArticleDOI
22 Jun 2013
TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented, and an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency.
Abstract: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

Patent
31 May 2013
TL;DR: In this article, the authors present a page display method and device which includes: in response to click operation on a browser by a user, updating, by the browser, a current page display rule according to parameters downloaded from a server corresponding to the browser and classifying and parsing the updated display rule; receiving a text from a target page, wherein the text contains a tag string used for displaying the target page.
Abstract: The present invention discloses a page display method and device. The method comprises: in response to a click operation on a browser by a user, updating, by the browser, a current page display rule according to parameters downloaded from a server corresponding to the browser, and classifying and parsing the updated display rule; receiving, by the browser, a text from a target page, wherein the text contains a tag string used for displaying the target page; when the browser parses a predetermined tag string in the tag strings, invoking, by the browser, the classified and parsed page display rule corresponding to the predetermined tag string to display the page. The technical solution according to the present invention accelerates the display speed of a target page, thus saving network traffic and improving the user experience.

Journal ArticleDOI
TL;DR: Hampi, a solver for string constraints over bounded string variables, is designed and implemented and used in static and dynamic analyses for finding SQL injection vulnerabilities in Web applications with hundreds of thousands of lines of code and in the context of automated bug finding in C programs using dynamic systematic testing.
Abstract: Many automatic testing, analysis, and verification techniques for programs can be effectively reduced to a constraint-generation phase followed by a constraint-solving phase. This separation of concerns often leads to more effective and maintainable software reliability tools. The increasing efficiency of off-the-shelf constraint solvers makes this approach even more compelling. However, there are few effective and sufficiently expressive off-the-shelf solvers for string constraints generated by analysis of string-manipulating programs, so researchers end up implementing their own ad-hoc solvers.To fulfill this need, we designed and implemented Hampi, a solver for string constraints over bounded string variables. Users of Hampi specify constraints using regular expressions, context-free grammars, equality between string terms, and typical string operations such as concatenation and substring extraction. Hampi then finds a string that satisfies all the constraints or reports that the constraints are unsatisfiable.We demonstrate Hampi's expressiveness and efficiency by applying it to program analysis and automated testing. We used Hampi in static and dynamic analyses for finding SQL injection vulnerabilities in Web applications with hundreds of thousands of lines of code. We also used Hampi in the context of automated bug finding in C programs using dynamic systematic testing (also known as concolic testing). We then compared Hampi with another string solver, CFGAnalyzer, and show that Hampi is several times faster. Hampi's source code, documentation, and experimental data are available at http://people.csail.mit.edu/akiezun/hampi1

Proceedings ArticleDOI
01 Dec 2013
TL;DR: An attributes-based approach to multi-writer word spotting that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare is proposed.
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query-by-example, where the query is an image, and query-by-string, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.

Proceedings ArticleDOI
25 Aug 2013
TL;DR: A word spotting framework that follows the query-by-string paradigm where word images are represented both by textual and visual representations and this statistical representation can be used together with state-of-the-art indexation structures in order to deal with large-scale scenarios.
Abstract: In this paper, we present a word spotting framework that follows the query-by-string paradigm where word images are represented both by textual and visual representations. The textual representation is formulated in terms of character n-grams while the visual one is based on the bag-of-visual-words scheme. These two representations are merged together and projected to a sub-vector space. This transform allows to, given a textual query, retrieve word instances that were only represented by the visual modality. Moreover, this statistical representation can be used together with state-of-the-art indexation structures in order to deal with large-scale scenarios. The proposed method is evaluated using a collection of historical documents outperforming state-of-the-art performances.

Proceedings Article
01 Aug 2013
TL;DR: Travatar, a forest-to-string machine translation (MT) engine based on tree transducers, is described and it is found that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation.
Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, andfind that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http://phontron.com/travatar

Journal ArticleDOI
TL;DR: This paper proposes a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly and proposes a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework.
Abstract: The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user-specified threshold. Recently, there has been considerable interest in designing new algorithms with the assistant of an inverted index to support efficient string similarity joins. These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: 1) generating candidate pairs by traversing the inverted index; and 2) verifying the candidate pairs by computing the similarity. However, these algorithms either suffer from poor filtering power (which results in high verification cost), or incur too much computational cost to guarantee the filtering power. In this paper, we propose a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly. We also propose a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework. We conduct extensive experiments on both centralized and Hadoop systems using both real and synthetic data sets, and the results show that our proposed approach outperforms existing approaches in both efficiency and scalability.

Book ChapterDOI
17 Jun 2013
TL;DR: In this article, the authors describe a linear time LZ factorization algorithm that requires only 2nlogn + O(logn) bits of working space to factorize a string of length n.
Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2nlogn + O(logn) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n logn bits less space than any previous linear time algorithm. The algorithms are also simple to implement, very fast in practice, and amenable to streaming implementation.

Book
31 Jul 2013
TL;DR: It is proved that rewriting by a match-bounded system preserves regular languages, hence it is decidable whether a given rewriting system has a given match bound and a criterion for the absence of a match bound is provided.
Abstract: We introduce a new class of automated proof methods for the termination of rewriting systems on strings. The basis of all these methods is to show that rewriting preserves regular languages. To this end, letters are annotated with natural numbers, called match heights. If the minimal height of all positions in a redex is h then every position in the reduct will get height h+1. In a match-bounded system, match heights are globally bounded. Using recent results on deleting systems, we prove that rewriting by a match-bounded system preserves regular languages. Hence it is decidable whether a given rewriting system has a given match bound. We also provide a criterion for the absence of a match-bound. It is still open whether match-boundedness is decidable. Match-boundedness for all strings can be used as an automated criterion for termination, for match-bounded systems are terminating. This criterion can be strengthened by requiring match-boundedness only for a restricted set of strings, namely the set of right hand sides of forward closures.

Journal ArticleDOI
TL;DR: A method to determine the photovoltaic (PV) series–parallel array configuration that provides the highest Global Maximum Power Point (GMPP) is proposed in this paper and is validated using simulations and experimental data.