Showing papers on "String (computer science) published in 2013"

PDF

Open Access

Proceedings Article•DOI•

Z3-str: a z3-based string solver for web application analysis

[...]

Yunhui Zheng¹, Xiangyu Zhang¹, Vijay Ganesh²•Institutions (2)

Purdue University¹, University of Waterloo²

18 Aug 2013

TL;DR: A general purpose string solver, called Z3-str, is developed as an extension of the Z3 SMT solver through its plug-in interface, which treats strings as a primitive type, thus avoiding the inherent limitations observed in many existing solvers that encode strings in terms of other primitives.

...read moreread less

Abstract: Analyzing web applications requires reasoning about strings and non-strings cohesively. Existing string solvers either ignore non-string program behavior or support limited set of string operations. In this paper, we develop a general purpose string solver, called Z3-str, as an extension of the Z3 SMT solver through its plug-in interface. Z3-str treats strings as a primitive type, thus avoiding the inherent limitations observed in many existing solvers that encode strings in terms of other primitives. The logic of the plug-in has three sorts, namely, bool, int and string. The string-sorted terms include string constants and variables of arbitrary length, with functions such as concatenation, sub-string, and replace. The int-sorted terms are standard, with the exception of the length function over string terms. The atomic formulas are equations over string terms, and (in)-equalities over integer terms. Not only does our solver have features that enable whole program symbolic, static and dynamic analysis, but also it performs better than other solvers in our experiments. The application of Z3-str in remote code execution detection shows that its support of a wide spectrum of string operations is key to reducing false positives.

...read moreread less

205 citations

Book Chapter•DOI•

[...]

Michelle Cheatham¹, Pascal Hitzler¹•Institutions (1)

Wright State University¹

21 Oct 2013

TL;DR: It is shown that if optimal string similarity metrics are chosen, those alone can produce alignments that are competitive with the state of the art in ontology alignment systems.

...read moreread less

Abstract: Ontology alignment is an important part of enabling the semantic web to reach its full potential. The vast majority of ontology alignment systems use one or more string similarity metrics, but often the choice of which metrics to use is not given much attention. In this work we evaluate a wide range of such metrics, along with string pre-processing strategies such as removing stop words and considering synonyms, on different types of ontologies. We also present a set of guidelines on when to use which metric. We furthermore show that if optimal string similarity metrics are chosen, those alone can produce alignments that are competitive with the state of the art in ontology alignment systems. Finally, we examine the improvements possible to an existing ontology alignment system using an automated string metric selection strategy based upon the characteristics of the ontologies to be aligned.

...read moreread less

183 citations

Patent•

Method and system for autocompletion for languages having ideographs and phonetic characters

[...]

Kevin A. Gibbs¹•Institutions (1)

Google¹

24 Sep 2013

TL;DR: In this paper, a set of ordered predicted completion strings including strings of ideographs are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar).

...read moreread less

Abstract: A set of ordered predicted completion strings including strings of ideographs are presented to a user as the user enters text in a text entry box (e.g., a browser or a toolbar). The user entered text may include zero or more ideographs followed by one or more phonetic characters, or the entered text may be one or more. The predicted completion strings can be in the form of URLs or query strings. The ordering may be based on any number of factors (e.g., a query's frequency of submission from a community of users). URLs can be ranked based on an importance value of the URL. The sets of ordered predicted completion strings are obtained by matching a fingerprint value of the user's entry string to a fingerprint to table map which contains the set of ordered predicted completion strings. The generation of the ordered prediction strings takes into account multiple phonetic representations of certain strings of ideographs.

...read moreread less

169 citations

Journal Article•DOI•

Reliable Transition State Searches Integrated with the Growing String Method.

[...]

Paul M. Zimmerman¹•Institutions (1)

University of Michigan¹

11 Jun 2013-Journal of Chemical Theory and Computation

TL;DR: An integrated technique for simultaneous reaction path and exact transition state search is described by implementing an eigenvector following optimization algorithm in internal coordinates with Hessian update techniques.

...read moreread less

Abstract: The growing string method (GSM) is highly useful for locating reaction paths connecting two molecular intermediates. GSM has often been used in a two-step procedure to locate exact transition states (TS), where GSM creates a quality initial structure for a local TS search. This procedure and others like it, however, do not always converge to the desired transition state because the local search is sensitive to the quality of the initial guess. This article describes an integrated technique for simultaneous reaction path and exact transition state search. This is achieved by implementing an eigenvector following optimization algorithm in internal coordinates with Hessian update techniques. After partial convergence of the string, an exact saddle point search begins under the constraint that the maximized eigenmode of the TS node Hessian has significant overlap with the string tangent near the TS. Subsequent optimization maintains connectivity of the string to the TS as well as locks in the TS direction, al...

...read moreread less

153 citations

Book Chapter•DOI•

Shorter Quasi-Adaptive NIZK Proofs for Linear Subspaces

[...]

Charanjit S. Jutla¹, Arnab Roy²•Institutions (2)

IBM¹, Fujitsu²

01 Dec 2013

TL;DR: A novel notion of quasi-adaptive non-interactive zero knowledge NIZK proofs for probability distributions on parametri- zed languages and shows that the system can be extended to include integer tags in the defining equations, where the tags are provided adaptively by the adversary.

...read moreread less

Abstract: We define a novel notion of quasi-adaptive non-interactive zero knowledge NIZK proofs for probability distributions on parametri- zed languages. It is quasi-adaptive in the sense that the common reference string CRS generator can generate the CRS depending on the language parameters. However, the simulation is required to be uniform, i.e., a single efficient simulator should work for the whole class of parametrized languages. For distributions on languages that are linear subspaces of vector spaces over bilinear groups, we give quasi-adaptive computationally sound NIZKs that are shorter and more efficient than Groth-Sahai NIZKs. For many cryptographic applications quasi-adaptive NIZKs suffice, and our constructions can lead to significant improvements in the standard model. Our construction can be based on any k-linear assumption, and in particular under the eXternal Diffie Hellman XDH assumption our proofs are even competitive with Random-Oracle based Σ-protocol NIZK proofs. We also show that our system can be extended to include integer tags in the defining equations, where the tags are provided adaptively by the adversary. This leads to applicability of our system to many applications that use tags, e.g. applications using Cramer-Shoup projective hash proofs. Our techniques also lead to the shortest known ciphertext fully secure identity based encryption IBE scheme under standard static assumptions SXDH. Further, we also get a short publicly-verifiable CCA2-secure IBE scheme.

...read moreread less

146 citations

Proceedings Article•DOI•

Why, when, and what: Analyzing Stack Overflow questions by topic, type, and code

[...]

Miltiadis Allamanis¹, Charles Sutton¹•Institutions (1)

University of Edinburgh¹

18 May 2013

TL;DR: A topic modeling analysis that combines question concepts, types, and code is presented to associate programming concepts and identifiers with particular types of questions, such as, “how to perform encoding”.

...read moreread less

Abstract: Questions from Stack Overflow provide a unique opportunity to gain insight into what programming concepts are the most confusing. We present a topic modeling analysis that combines question concepts, types, and code. Using topic modeling, we are able to associate programming concepts and identifiers (like the String class) with particular types of questions, such as, “how to perform encoding”.

...read moreread less

135 citations

Journal Article•DOI•

Growing string method with interpolation and optimization in internal coordinates: Method and examples

[...]

Paul M. Zimmerman¹•Institutions (1)

University of Michigan¹

09 May 2013-Journal of Chemical Physics

TL;DR: A detailed description of the generation of internal coordinates suitable for use in GSM as reactive tangents and in string optimization is given, and a climbing image scheme is included to improve the quality of the transition state approximation, ensuring high reliability of the method.

...read moreread less

Abstract: The growing string method (GSM) has proven especially useful for locating chemical reaction paths at low computational cost. While many string methods use Cartesian coordinates, these methods can be substantially improved by changes in the coordinate system used for interpolation and optimization steps. The quality of the interpolation scheme is especially important because it determines how close the initial path is to the optimized reaction path, and this strongly affects the rate of convergence. In this article, a detailed description of the generation of internal coordinates (ICs) suitable for use in GSM as reactive tangents and in string optimization is given. Convergence of reaction paths is smooth because the IC tangent and orthogonal directions are better representations of chemical bonding compared to Cartesian coordinates. This is not only important quantitatively for reducing computational cost but also allows reaction paths to be described with smoothly varying chemically relevant coordinates. Benchmark computations with challenging reactions are compared to previous versions of GSM and show significant speedups. Finally, a climbing image scheme is included to improve the quality of the transition state approximation, ensuring high reliability of the method.

...read moreread less

133 citations

Journal Article•DOI•

Lightweight algorithms for constructing and inverting the BWT of string collections

[...]

Markus J. Bauer¹, Anthony J. Cox¹, Giovanna Rosone²•Institutions (2)

Illumina¹, University of Palermo²

01 Apr 2013-Theoretical Computer Science

TL;DR: This work presents a methodology for computing the Burrows-Wheeler transform (BWT) of a string collection in a lightweight fashion, and gives two algorithms for recovering the strings in a collection from its BWT.

...read moreread less

104 citations

Journal Article•DOI•

SF: Science Fiction, Speculative Fabulation, String Figures, So Far

[...]

Donna Haraway¹•Institutions (1)

University of California, Santa Cruz¹

01 Nov 2013-Ada: A Journal of Gender, New Media, and Technology

TL;DR: The authors argue for audacious pedagogies of speculative fabulation, arguing that the kinds of pedagogical endeavours that times of uncertainty call for are by no means straightforward, calling as I argue along with Elizabeth de Freitas (2020) writing in this issue, for more venturesome approaches informed by speculative posthuman inquiries and exploratory new materialisms.

...read moreread less

Abstract: In resistance to capitalist logics of speculation, this article argues for audacious pedagogies of speculative fabulation. The kinds of pedagogical endeavours that times of uncertainty call for are by no means straightforward, calling as I argue along with Elizabeth de Freitas (2020) writing in this issue, for more venturesome approaches informed by speculative posthuman inquiries and exploratory new materialisms. The Anthropocene or Capitalocene are terms that capture equivocal nature of the crisis-riven present. Laden with contradiction and destruction, these descriptors also embody strange afterlives. Beyond problematic present/futures produced by humans only for themselves lie intimate and uncanny sympoiesis, world-buildings and meaningmakings with non-human others and more than human processes. In accounting for these as well as for the already entangled material conditions of our time, pedagogy needs to pay attention to the slippery nature of cognition itself; a task to which the genre of science-fiction or speculative fabulation (SF) is primed.

...read moreread less

94 citations

Journal Article•DOI•

Text extraction from scene images by character appearance and structure modeling

[...]

Chucai Yi¹, Yingli Tian¹•Institutions (1)

City College of New York¹

01 Feb 2013-Computer Vision and Image Understanding

TL;DR: A novel algorithm to detect text information from natural scene images that achieves the state-of-the-art performance on scene text classification and detection, and significantly outperforms the existing algorithms for character identification.

...read moreread less

94 citations

Journal Article•DOI•

Reaction Path Optimization without NEB Springs or Interpolation Algorithms.

[...]

Philipp N. Plessow¹•Institutions (1)

Heidelberg University¹

25 Feb 2013-Journal of Chemical Theory and Computation

TL;DR: This letter describes a chain-of-states method that optimizes reaction paths under the sole constraint of equally spaced structures that requires no spring forces, interpolation algorithms, or other heuristics to control structure distribution.

...read moreread less

Abstract: This letter describes a chain-of-states method that optimizes reaction paths under the sole constraint of equally spaced structures. In contrast to NEB and string methods, it requires no spring forces, interpolation algorithms, or other heuristics to control structure distribution. Rigorous use of a quadratic PES allows calculation of an optimization step with a predefined distribution in Cartesian space. The method is a formal extension of single-structure quasi-Newton methods. An initial guess can be evolved, as in the growing string method.

...read moreread less

Proceedings Article•DOI•

Evolving Readable String Test Inputs Using a Natural Language Model to Reduce Human Oracle Cost

[...]

Sheeva Afshan¹, Phil McMinn¹, Mark Stevenson¹•Institutions (1)

University of Sheffield¹

18 Mar 2013

TL;DR: An approach in which a natural language model is incorporated into a search-based input data generation process with the aim of improving the human readability of generated strings is presented.

...read moreread less

Abstract: The frequent non-availability of an automated oracle means that, in practice, checking software behaviour is frequently a painstakingly manual task. Despite the high cost of human oracle involvement, there has been little research investigating how to make the role easier and less time-consuming. One source of human oracle cost is the inherent unreadability of machine-generated test inputs. In particular, automatically generated string inputs tend to be arbitrary sequences of characters that are awkward to read. This makes test cases hard to comprehend and time-consuming to check. In this paper we present an approach in which a natural language model is incorporated into a search-based input data generation process with the aim of improving the human readability of generated strings. We further present a human study of test inputs generated using the technique on 17 open source Java case studies. For 10 of the case studies, the participants recorded significantly faster times when evaluating inputs produced using the language model, with medium to large effect sizes 60% of the time. In addition, the study found that accuracy of test input evaluation was also significantly improved for 3 of the case studies.

...read moreread less

Journal Article•DOI•

Handwritten Chinese/Japanese Text Recognition Using Semi-Markov Conditional Random Fields

[...]

Xiang-Dong Zhou, Da-Han Wang, Feng Tian, Cheng-Lin Liu, Masaki Nakagawa¹ - Show less +1 more•Institutions (1)

University of Tokyo¹

01 Oct 2013-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: A forward-backward lattice pruning algorithm is proposed to reduce the computation in training when trigram language models are used, and beam search techniques are investigated to accelerate the decoding speed.

...read moreread less

Abstract: This paper proposes a method for handwritten Chinese/Japanese text (character string) recognition based on semi-Markov conditional random fields (semi-CRFs). The high-order semi-CRF model is defined on a lattice containing all possible segmentation-recognition hypotheses of a string to elegantly fuse the scores of candidate character recognition and the compatibilities of geometric and linguistic contexts by representing them in the feature functions. Based on given models of character recognition and compatibilities, the fusion parameters are optimized by minimizing the negative log-likelihood loss with a margin term on a training string sample set. A forward-backward lattice pruning algorithm is proposed to reduce the computation in training when trigram language models are used, and beam search techniques are investigated to accelerate the decoding speed. We evaluate the performance of the proposed method on unconstrained online handwritten text lines of three databases. On the test sets of databases CASIA-OLHWDB (Chinese) and TUAT Kondate (Japanese), the character level correct rates are 95.20 and 95.44 percent, and the accurate rates are 94.54 and 94.55 percent, respectively. On the test set (online handwritten texts) of ICDAR 2011 Chinese handwriting recognition competition, the proposed method outperforms the best system in competition.

...read moreread less

Book Chapter•DOI•

Versatile Succinct Representations of the Bidirectional Burrows-Wheeler Transform

[...]

Djamal Belazzougui¹, Fabio Cunial¹, Juha Kärkkäinen¹, Veli Mäkinen¹•Institutions (1)

University of Helsinki¹

02 Sep 2013

TL;DR: P succinct and compact representations of the bidirectional bwt of a string s ∈ Σ* which provide increasing navigation power and a number of space-time tradeoffs are described, resulting in near-linear time algorithms for many sequence analysis problems for the first time in succinct space.

...read moreread less

Abstract: We describe succinct and compact representations of the bidirectional bwt of a string s ∈ Σ* which provide increasing navigation power and a number of space-time tradeoffs. One such representation allows to extend a substring of s by one character from the left and from the right in constant time, taking O(|s| log |Σ|) bits of space. We then match the functions supported by each representation to a number of algorithms that traverse the nodes of the suffix tree of s, exploiting connections between the bwt and the suffix-link tree. This results in near-linear time algorithms for many sequence analysis problems (e.g. maximal unique matches), for the first time in succinct space.

...read moreread less

Posted Content•

Order Preserving Matching

[...]

Jinil Kim¹, Peter Eades², Rudolf Fleischer³, Seok-Hee Hong², Costas S. Iliopoulos⁴, Kunsoo Park¹, Simon J. Puglisi⁵, Takeshi Tokuyama⁶ - Show less +4 more•Institutions (6)

Seoul National University¹, University of Sydney², German University of Technology in Oman³, King's College London⁴, University of Helsinki⁵, Tohoku University⁶

17 Feb 2013-arXiv: Data Structures and Algorithms

TL;DR: Order-preserving matching on numeric strings was introduced in this article, where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern.

...read moreread less

Abstract: We introduce a new string matching problem called order-preserving matching on numeric strings where a pattern matches a text if the text contains a substring whose relative orders coincide with those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price analysis and musical melody matching in which the order relations should be matched instead of the strings themselves. Solving order-preserving matching has to do with representations of order relations of a numeric string. We define prefix representation and nearest neighbor representation, which lead to efficient algorithms for order-preserving matching. We present efficient algorithms for single and multiple pattern cases. For the single pattern case, we give an O(n log m) time algorithm and optimize it further to obtain O(n + m log m) time. For the multiple pattern case, we give an O(n log m) time algorithm.

...read moreread less

Proceedings Article•DOI•

[...]

Dong Deng¹, Guoliang Li¹, Jianhua Feng¹, Wen-Syan Li•Institutions (1)

Tsinghua University¹

08 Apr 2013

TL;DR: This paper proposes a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance, and develops a range-based method by grouping the pivotal entries to avoid duplicated computations.

...read moreread less

Abstract: String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamic-programming algorithm to compute edit distance. We prune unnecessary entries in the dynamic-programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a range-based method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets.

...read moreread less

Proceedings Article•DOI•

Diglossia: detecting code injection attacks with precision and efficiency

[...]

Sooel Son¹, Kathryn S. McKinley¹, Vitaly Shmatikov¹•Institutions (1)

University of Texas at Austin¹

04 Nov 2013

TL;DR: DIGLOSSIA accurately detects both SQL and NoSQL code injection attacks while avoiding the false positives and false negatives of prior methods, and recasts the problem of detecting injected code as a string propagation and parsing problem, gaining substantial improvements in efficiency and precision over prior work.

...read moreread less

Abstract: Code injection attacks continue to plague applications that incorporate user input into executable programs. For example, SQL injection vulnerabilities rank fourth among all bugs reported in CVE, yet all previously proposed methods for detecting SQL injection attacks suffer from false positives and false negatives.This paper describes the design and implementation of DIGLOSSIA, a new tool that precisely and efficiently detects code injection attacks on server-side Web applications generating SQL and NoSQL queries. The main problems in detecting injected code are (1) recognizing code in the generated query, and (2) determining which parts of the query are tainted by user input. To recognize code, DIGLOSSIA relies on the precise definition due to Ray and Ligatti. To identify tainted characters, DIGLOSSIA dynamically maps all application-generated characters to shadow characters that do not occur in user input and computes shadow values for all input-dependent strings. Any original characters in a shadow value are thus exactly the taint from user input.Our key technical innovation is dual parsing. To detect injected code in a generated query, DIGLOSSIA parses the query in tandem with its shadow and checks that (1) the two parse trees are syntactically isomorphic, and (2) all code in the shadow query is in shadow characters and, therefore, originated from the application itself, as opposed to user input.We demonstrate that DIGLOSSIA accurately detects both SQL and NoSQL code injection attacks while avoiding the false positives and false negatives of prior methods. By recasting the problem of detecting injected code as a string propagation and parsing problem, we gain substantial improvements in efficiency and precision over prior work. Our approach does not require any changes to the databases, Web servers, or Web browsers, adds virtually unnoticeable performance overhead, and is deployable today.

...read moreread less

Journal Article•DOI•

Practical linear-time O(1)-workspace suffix sorting for constant alphabets

[...]

Ge Nong¹•Institutions (1)

Sun Yat-sen University¹

05 Aug 2013-ACM Transactions on Information Systems

TL;DR: In this experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA), and is around 33% faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA.

...read moreread less

Abstract: This article presents an O(n)-time algorithm called SACA-K for sorting the suffixes of an input string T[0, n-1] over an alphabet A[0, K-1]. The problem of sorting the suffixes of T is also known as constructing the suffix array (SA) for T. The theoretical memory usage of SACA-K is nlogK p nlogn p Klogn bits. Moreover, we also have a practical implementation for SACA-K that uses n bytes p (n p 256) words and is suitable for strings over any alphabet up to full ASCII, where a word is log n bits. In our experiment, SACA-K outperforms SA-IS that was previously the most time- and space-efficient linear-time SA construction algorithm (SACA). SACA-K is around 33p faster and uses a smaller deterministic workspace of K words, where the workspace is the space needed beyond the input string and the output SA. Given K=O(1), SACA-K runs in linear time and O(1) workspace. To the best of our knowledge, such a result is the first reported in the literature with a practical source code publicly available.

...read moreread less

Journal Article•DOI•

A model of photovoltaic fields in mismatching conditions featuring an improved calculation speed

[...]

Juan David Bastidas¹, Edinson Franco¹, Giovanni Petrone², Carlos Andrés Ramos-Paja³, Giovanni Spagnuolo² - Show less +1 more•Institutions (3)

University of Valle¹, University of Salerno², National University of Colombia³

01 Mar 2013-Electric Power Systems Research

TL;DR: In this article, a model of photovoltaic (PV) fields in mismatching conditions presented in this paper is a tradeoff between accuracy and calculation time, which profits from the possibility to express each PV module voltage as an explicit function of the current by using the Lambert-W function.

...read moreread less

Proceedings Article•DOI•

Space-efficient data structures for Top-k completion

[...]

Bo-June (Paul) Hsu¹, Giuseppe Ottaviano²•Institutions (2)

Microsoft¹, University of Pisa²

13 May 2013

TL;DR: This paper presents three different trie-based data structures to address the case where the string set is so large that compression is needed to fit the data structure in memory, and shows that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data.

...read moreread less

Abstract: Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.

...read moreread less

Proceedings Article•DOI•

[...]

Jiaheng Lu¹, Chunbin Lin¹, Wei Wang², Chen Li³, Haiyong Wang¹ - Show less +1 more•Institutions (3)

Renmin University of China¹, University of New South Wales², University of California, Irvine³

22 Jun 2013

TL;DR: An expansion-based framework to measure string similarities efficiently while considering synonyms is presented, and an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency.

...read moreread less

Abstract: A string similarity measure quantifies the similarity between two text strings for approximate string matching or comparison. For example, the strings "Sam" and "Samuel" can be considered similar. Most existing work that computes the similarity of two strings only considers syntactic similarities, e.g., number of common words or q-grams. While these are indeed indicators of similarity, there are many important cases where syntactically different strings can represent the same real-world object. For example, "Bill" is a short form of "William". Given a collection of predefined synonyms, the purpose of the paper is to explore such existing knowledge to evaluate string similarity measures more effectively and efficiently, thereby boosting the quality of string matching.In particular, we first present an expansion-based framework to measure string similarities efficiently while considering synonyms. Because using synonyms in similarity measures is, while expressive, computationally expensive (NP-hard), we propose an efficient algorithm, called selective-expansion, which guarantees the optimality in many real scenarios. We then study a novel indexing structure called SI-tree, which combines both signature and length filtering strategies, for efficient string similarity joins with synonyms. We develop an estimator to approximate the size of candidates to enable an online selection of signature filters to further improve the efficiency. This estimator provides strong low-error, high-confidence guarantees while requiring only logarithmic space and time costs, thus making our method attractive both in theory and in practice. Finally, the results from an empirical study of the algorithms verify the effectiveness and efficiency of our approach.

...read moreread less

Patent•

Page display method and device

[...]

Jie Liang, Wenping Luo

31 May 2013

TL;DR: In this article, the authors present a page display method and device which includes: in response to click operation on a browser by a user, updating, by the browser, a current page display rule according to parameters downloaded from a server corresponding to the browser and classifying and parsing the updated display rule; receiving a text from a target page, wherein the text contains a tag string used for displaying the target page.

...read moreread less

Abstract: The present invention discloses a page display method and device. The method comprises: in response to a click operation on a browser by a user, updating, by the browser, a current page display rule according to parameters downloaded from a server corresponding to the browser, and classifying and parsing the updated display rule; receiving, by the browser, a text from a target page, wherein the text contains a tag string used for displaying the target page; when the browser parses a predetermined tag string in the tag strings, invoking, by the browser, the classified and parsed page display rule corresponding to the predetermined tag string to display the page. The technical solution according to the present invention accelerates the display speed of a target page, thus saving network traffic and improving the user experience.

...read moreread less

Journal Article•DOI•

HAMPI: A solver for word equations over strings, regular expressions, and context-free grammars

[...]

Adam Kiezun¹, Vijay Ganesh², Shay Artzi³, Philip J. Guo⁴, Pieter Hooimeijer⁵, Michael D. Ernst⁶ - Show less +2 more•Institutions (6)

Brigham and Women's Hospital¹, Massachusetts Institute of Technology², IBM³, Stanford University⁴, University of Virginia⁵, University of Washington⁶

07 Feb 2013-ACM Transactions on Software Engineering and Methodology

TL;DR: Hampi, a solver for string constraints over bounded string variables, is designed and implemented and used in static and dynamic analyses for finding SQL injection vulnerabilities in Web applications with hundreds of thousands of lines of code and in the context of automated bug finding in C programs using dynamic systematic testing.

...read moreread less

Abstract: Many automatic testing, analysis, and verification techniques for programs can be effectively reduced to a constraint-generation phase followed by a constraint-solving phase. This separation of concerns often leads to more effective and maintainable software reliability tools. The increasing efficiency of off-the-shelf constraint solvers makes this approach even more compelling. However, there are few effective and sufficiently expressive off-the-shelf solvers for string constraints generated by analysis of string-manipulating programs, so researchers end up implementing their own ad-hoc solvers.To fulfill this need, we designed and implemented Hampi, a solver for string constraints over bounded string variables. Users of Hampi specify constraints using regular expressions, context-free grammars, equality between string terms, and typical string operations such as concatenation and substring extraction. Hampi then finds a string that satisfies all the constraints or reports that the constraints are unsatisfiable.We demonstrate Hampi's expressiveness and efficiency by applying it to program analysis and automated testing. We used Hampi in static and dynamic analyses for finding SQL injection vulnerabilities in Web applications with hundreds of thousands of lines of code. We also used Hampi in the context of automated bug finding in C programs using dynamic systematic testing (also known as concolic testing). We then compared Hampi with another string solver, CFGAnalyzer, and show that Hampi is several times faster. Hampi's source code, documentation, and experimental data are available at http://people.csail.mit.edu/akiezun/hampi1

...read moreread less

Proceedings Article•DOI•

Handwritten Word Spotting with Corrected Attributes

[...]

Jon Almazan, Albert Gordo¹, Alicia Fornés, Ernest Valveny•Institutions (1)

French Institute for Research in Computer Science and Automation¹

01 Dec 2013

TL;DR: An attributes-based approach to multi-writer word spotting that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare is proposed.

...read moreread less

Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query-by-example, where the query is an image, and query-by-string, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.

...read moreread less

Proceedings Article•DOI•

Integrating Visual and Textual Cues for Query-by-String Word Spotting

[...]

David Aldavert, Marçal Rusiñol, Ricardo Toledo, Josep Lladós

25 Aug 2013

TL;DR: A word spotting framework that follows the query-by-string paradigm where word images are represented both by textual and visual representations and this statistical representation can be used together with state-of-the-art indexation structures in order to deal with large-scale scenarios.

...read moreread less

Abstract: In this paper, we present a word spotting framework that follows the query-by-string paradigm where word images are represented both by textual and visual representations. The textual representation is formulated in terms of character n-grams while the visual one is based on the bag-of-visual-words scheme. These two representations are merged together and projected to a sub-vector space. This transform allows to, given a textual query, retrieve word instances that were only represented by the visual modality. Moreover, this statistical representation can be used together with state-of-the-art indexation structures in order to deal with large-scale scenarios. The proposed method is evaluated using a collection of historical documents outperforming state-of-the-art performances.

...read moreread less

Proceedings Article•

Travatar: A Forest-to-String Machine Translation Engine based on Tree Transducers

[...]

Graham Neubig¹•Institutions (1)

Nara Institute of Science and Technology¹

01 Aug 2013

TL;DR: Travatar, a forest-to-string machine translation (MT) engine based on tree transducers, is described and it is found that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation.

...read moreread less

Abstract: In this paper we describe Travatar, a forest-to-string machine translation (MT) engine based on tree transducers. It provides an open-source C++ implementation for the entire forest-to-string MT pipeline, including rule extraction, tuning, decoding, and evaluation. There are a number of options for model training, and tuning includes advanced options such as hypergraph MERT, and training of sparse features through online learning. The training pipeline is modeled after that of the popular Moses decoder, so users familiar with Moses should be able to get started quickly. We perform a validation experiment of the decoder on EnglishJapanese machine translation, andfind that it is possible to achieve greater accuracy than translation using phrase-based and hierarchical-phrase-based translation. As auxiliary results, we also compare different syntactic parsers and alignment techniques that we tested in the process of developing the decoder. Travatar is available under the LGPL at http://phontron.com/travatar

...read moreread less

Journal Article•DOI•

Efficient and Scalable Processing of String Similarity Join

[...]

Chuitian Rong¹, Wei Lu², Xiaoli Wang², Xiaoyong Du¹, Yueguo Chen¹, Anthony K. H. Tung² - Show less +2 more•Institutions (2)

Renmin University of China¹, National University of Singapore²

01 Oct 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This paper proposes a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly and proposes a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework.

...read moreread less

Abstract: The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user-specified threshold. Recently, there has been considerable interest in designing new algorithms with the assistant of an inverted index to support efficient string similarity joins. These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: 1) generating candidate pairs by traversing the inverted index; and 2) verifying the candidate pairs by computing the similarity. However, these algorithms either suffer from poor filtering power (which results in high verification cost), or incur too much computational cost to guarantee the filtering power. In this paper, we propose a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly. We also propose a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework. We conduct extensive experiments on both centralized and Hadoop systems using both real and synthetic data sets, and the results show that our proposed approach outperforms existing approaches in both efficiency and scalability.

...read moreread less

Book Chapter•DOI•

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

[...]

Juha Kärkkäinen¹, Dominik Kempa¹, Simon J. Puglisi¹•Institutions (1)

University of Helsinki¹

17 Jun 2013

TL;DR: In this article, the authors describe a linear time LZ factorization algorithm that requires only 2nlogn + O(logn) bits of working space to factorize a string of length n.

...read moreread less

Abstract: Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck in many diverse applications, including data compression, text indexing, and pattern discovery. We describe new linear time LZ factorization algorithms, some of which require only 2nlogn + O(logn) bits of working space to factorize a string of length n. These are the most space efficient linear time algorithms to date, using n logn bits less space than any previous linear time algorithm. The algorithms are also simple to implement, very fast in practice, and amenable to streaming implementation.

...read moreread less

Book•

Match-Bounded String Rewriting Systems

[...]

Alfons Geser¹, Dieter Hofbauer, Johannes Waldmann•Institutions (1)

National Institute of Aerospace¹

31 Jul 2013

TL;DR: It is proved that rewriting by a match-bounded system preserves regular languages, hence it is decidable whether a given rewriting system has a given match bound and a criterion for the absence of a match bound is provided.

...read moreread less

Abstract: We introduce a new class of automated proof methods for the termination of rewriting systems on strings. The basis of all these methods is to show that rewriting preserves regular languages. To this end, letters are annotated with natural numbers, called match heights. If the minimal height of all positions in a redex is h then every position in the reduct will get height h+1. In a match-bounded system, match heights are globally bounded. Using recent results on deleting systems, we prove that rewriting by a match-bounded system preserves regular languages. Hence it is decidable whether a given rewriting system has a given match bound. We also provide a criterion for the absence of a match-bound. It is still open whether match-boundedness is decidable. Match-boundedness for all strings can be used as an automated criterion for termination, for match-bounded systems are terminating. This criterion can be strengthened by requiring match-boundedness only for a restricted set of strings, namely the set of right hand sides of forward closures.

...read moreread less

Journal Article•DOI•

Reconfiguration analysis of photovoltaic arrays based on parameters estimation

[...]

Juan David Bastidas-Rodriguez¹, Carlos Andrés Ramos-Paja², Andrés Julián Saavedra-Montes²•Institutions (2)

University of Valle¹, National University of Colombia²

01 Jun 2013-Simulation Modelling Practice and Theory

TL;DR: A method to determine the photovoltaic (PV) series–parallel array configuration that provides the highest Global Maximum Power Point (GMPP) is proposed in this paper and is validated using simulations and experimental data.

...read moreread less

Collapse