Showing papers on "String (computer science) published in 2010"

PDF

Open Access

Journal Article•DOI•

String-Stable CACC Design and Experimental Validation: A Frequency-Domain Approach

[...]

Gerrit Naus¹, René P A Vugts¹, Jeroen Ploeg, Marinus J.G. van de Molengraft¹, M Maarten Steinbuch¹ - Show less +1 more•Institutions (1)

Eindhoven University of Technology¹

13 Sep 2010-IEEE Transactions on Vehicular Technology

TL;DR: Implementation of the CACC system, the string-stability characteristics of the practical setup, and experimental results are discussed, indicating the advantages of the design over standard adaptive-cruise-control functionality.

...read moreread less

Abstract: The design of a cooperative adaptive cruise-control (CACC) system and its practical validation are presented. Focusing on the feasibility of implementation, a decentralized controller design with a limited communication structure is proposed (in this case, a wireless communication link with the nearest preceding vehicle only). A necessary and sufficient frequency-domain condition for string stability is derived, taking into account heterogeneous traffic, i.e., vehicles with possibly different characteristics. For a velocity-dependent intervehicle spacing policy, it is shown that the wireless communication link enables driving at small intervehicle distances, whereas string stability is guaranteed. For a constant velocity-independent intervehicle spacing, string stability cannot be guaranteed. To validate the theoretical results, experiments are performed with two CACC-equipped vehicles. Implementation of the CACC system, the string-stability characteristics of the practical setup, and experimental results are discussed, indicating the advantages of the design over standard adaptive-cruise-control functionality.

...read moreread less

779 citations

Proceedings Article•DOI•

A Symbolic Execution Framework for JavaScript

[...]

Prateek Saxena¹, Devdatta Akhawe¹, Steve Hanna¹, Feng Mao¹, Stephen McCamant¹, Dawn Song¹ - Show less +2 more•Institutions (1)

University of California, Berkeley¹

16 May 2010

TL;DR: This paper builds an automatic end-to-end tool, Kudzu, and applies it to the problem of finding client-side code injection vulnerabilities, and designs a new language of string constraints and implements a solver for it.

...read moreread less

Abstract: As AJAX applications gain popularity, client-side JavaScript code is becoming increasingly complex. However, few automated vulnerability analysis tools for JavaScript exist. In this paper, we describe the first system for exploring the execution space of JavaScript code using symbolic execution. To handle JavaScript code’s complex use of string operations, we design a new language of string constraints and implement a solver for it. We build an automatic end-to-end tool, Kudzu, and apply it to the problem of finding client-side code injection vulnerabilities. In experiments on 18 live web applications, Kudzu automatically discovers 2 previously unknown vulnerabilities and 9 more that were previously found only with a manually-constructed test suite.

...read moreread less

474 citations

Book Chapter•DOI•

Short Pairing-Based Non-interactive Zero-Knowledge Arguments

[...]

Jens Groth¹•Institutions (1)

University College London¹

05 Dec 2010

TL;DR: This work constructs non-interactive zero-knowledge arguments for circuit satisfiability with perfect completeness, perfect zero- knowledge and computational soundness and security is based on two new cryptographic assumptions.

...read moreread less

Abstract: We construct non-interactive zero-knowledge arguments for circuit satisfiability with perfect completeness, perfect zero-knowledge and computational soundness. The non-interactive zero-knowledge arguments have sub-linear size and very efficient public verification. The size of the non-interactive zero-knowledge arguments can even be reduced to a constant number of group elements if we allow the common reference string to be large. Our constructions rely on groups with pairings and security is based on two new cryptographic assumptions; we do not use the Fiat-Shamir heuristic or random oracles.

...read moreread less

457 citations

Book Chapter•DOI•

Extracting powers and periods in a string from its runs structure

[...]

Maxime Crochemore¹, Costas S. Iliopoulos², Marcin Kubica³, Jakub Radoszewski³, Wojciech Rytter³, Tomasz Waleń³ - Show less +2 more•Institutions (3)

King's College London¹, Curtin University², University of Warsaw³

11 Oct 2010

TL;DR: This paper uses Lyndon words and introduces the Lyndon structure of runs as a useful tool when computing powers and presents an efficient algorithm for testing primitivity of factors of a string and computing their primitive roots.

...read moreread less

Abstract: A breakthrough in the field of text algorithms was the discovery of the fact that the maximal number of runs in a string of length n is O(n) and that they can all be computed in O(n) time. We study some applications of this result. New simpler O(n) time algorithms are presented for a few classical string problems: computing all distinct kth string powers for a given k, in particular squares for k = 2, and finding all local periods in a given string of length n. Additionally, we present an efficient algorithm for testing primitivity of factors of a string and computing their primitive roots. Applications of runs, despite their importance, are underrepresented in existing literature (approximately one page in the paper of Kolpakov & Kucherov, 1999). In this paper we attempt to fill in this gap. We use Lyndon words and introduce the Lyndon structure of runs as a useful tool when computing powers. In problems related to periods we use some versions of the Manhattan skyline problem.

...read moreread less

439 citations

Journal Article•DOI•

Efficient construction of an assembly string graph using the FM-index

[...]

Jared T. Simpson¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jun 2010-Bioinformatics

TL;DR: The algorithms presented here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.

...read moreread less

Abstract: Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact: js18@sanger.ac.uk

...read moreread less

267 citations

Journal Article•

On the Maximal Number of Cubic Runs in a String

[...]

Maxime Crochemore¹, Costas S. Iliopoulos¹, Marcin Kubica², Jakub Radoszewski², Wojciech Rytter², Tomasz Waleń² - Show less +2 more•Institutions (2)

King's College London¹, University of Warsaw²

01 Jan 2010-Lecture Notes in Computer Science

TL;DR: The upper bound of 0.5 n on the maximal number of runs in a string of length n has been shown in this article, and the lower bound is 0.406 n.

...read moreread less

Abstract: A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition v with a period p such that 2p≤|v|. The maximal number of runs in a string of length n has been thoroughly studied, and is known to be between 0.944 n and 1.029 n. In this paper we investigate cubic runs, in which the shortest period p satisfies 3p≤|v|. We show the upper bound of 0.5 n on the maximal number of such runs in a string of length n, and construct an infinite sequence of words over binary alphabet for which the lower bound is 0.406 n.

...read moreread less

266 citations

Patent•

System and method for detecting executable machine instructions in a data stream

[...]

Peter J. Silberman, II James R. Butler, Nick J. Harbour

27 Oct 2010

TL;DR: In this article, the pre-processing may include determining whether the plurality of values meets a randomness condition, a length condition, and/or a string ratio condition; the candidate data subset is inspected for computer instructions, characteristics of the computer instructions are determined, and a predetermined action taken based on the characteristics of computer instructions.

...read moreread less

Abstract: Detecting executable machine instructions in a data is accomplished by accessing a plurality of values representing data contained within a memory of a computer system and performing pre-processing on the plurality of values to produce a candidate data subset. The pre-processing may include determining whether the plurality of values meets (a) a randomness condition, (b) a length condition, and/or (c) a string ratio condition. The candidate data subset is inspected for computer instructions, characteristics of the computer instructions are determined, and a predetermined action taken based on the characteristics of the computer instructions.

...read moreread less

187 citations

Patent•

Non-volatile memory device having vertical structure and method of operating the same

[...]

Jae-Hun Jeong¹, Han-soo Kim, Wonseok Cho, Jae-Hoon Jang, Sunil Shim - Show less +1 more•Institutions (1)

Samsung¹

02 Feb 2010

TL;DR: In this paper, the first selection line is connected to the at least one pair of first selection transistors of the NAND string and a plurality of word lines are coupled to the plurality of memory cells.

...read moreread less

Abstract: A non-volatile memory device having a vertical structure includes a NAND string having a vertical structure. The NAND string includes a plurality of memory cells, and at least one pair of first selection transistors arranged to be adjacent to a first end of the plurality of memory cells. A plurality of word lines are coupled to the plurality of memory cells of the NAND string. A first selection line is commonly connected to the at least one pair of first selection transistors of the NAND string.

...read moreread less

182 citations

Book Chapter•DOI•

STRANGER: an automata-based string analysis tool for PHP

[...]

Fang Yu¹, Muath Alkhalaf¹, Tevfik Bultan¹•Institutions (1)

University of California, Santa Barbara¹

20 Mar 2010

TL;DR: Stranger can automatically prove that an application is free from specified attacks or generate vulnerability signatures that characterize all malicious inputs that can be used to generate attacks.

...read moreread less

Abstract: Stranger is an automata-based string analysis tool for finding and eliminating string-related security vulnerabilities in PHP applications. Stranger uses symbolic forward and backward reachability analyses to compute the possible values that the string expressions can take during program execution. Stranger can automatically (1) prove that an application is free from specified attacks or (2) generate vulnerability signatures that characterize all malicious inputs that can be used to generate attacks.

...read moreread less

152 citations

Posted Content•

Reflections for quantum query algorithms

[...]

Ben W. Reichardt¹•Institutions (1)

University of Waterloo¹

10 May 2010-arXiv: Quantum Physics

TL;DR: In this paper, it was shown that any boolean function can be evaluated optimally by a quantum query algorithm that alternates a fixed, input-independent reflection with a second reflection that coherently queries the input string.

...read moreread less

Abstract: We show that any boolean function can be evaluated optimally by a quantum query algorithm that alternates a certain fixed, input-independent reflection with a second reflection that coherently queries the input string. Originally introduced for solving the unstructured search problem, this two-reflections structure is therefore a universal feature of quantum algorithms. Our proof goes via the general adversary bound, a semi-definite program (SDP) that lower-bounds the quantum query complexity of a function. By a quantum algorithm for evaluating span programs, this lower bound is known to be tight up to a sub-logarithmic factor. The extra factor comes from converting a continuous-time query algorithm into a discrete-query algorithm. We give a direct and simplified quantum algorithm based on the dual SDP, with a bounded-error query complexity that matches the general adversary bound. Therefore, the general adversary lower bound is tight; it is in fact an SDP for quantum query complexity. This implies that the quantum query complexity of the composition f(g,...,g) of two boolean functions f and g matches the product of the query complexities of f and g, without a logarithmic factor for error reduction. It further shows that span programs are equivalent to quantum query algorithms.

...read moreread less

148 citations

Journal Article•DOI•

Trie-join: efficient trie-based string similarity joins with edit-distance constraints

[...]

Jiannan Wang¹, Jianhua Feng¹, Guoliang Li¹•Institutions (1)

Tsinghua University¹

01 Sep 2010

TL;DR: This paper proposes a novel framework called trie-join, which uses a trie structure to index the strings and utilizes the triE structure to efficiently find the similar string pairs based on subtrie pruning.

...read moreread less

Abstract: A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel framework called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.

...read moreread less

Proceedings Article•DOI•

Expressiveness of Streaming String Transducers

[...]

Rajeev Alur¹, Pavol Černý¹•Institutions (1)

University of Pennsylvania¹

01 Jan 2010

TL;DR: This work focuses on streaming transducers processing strings over finite alphabets, given the existence of a robust and well-studied class of ``regular'' transductions for this case, and finds that the expressiveness of streaming string transducers coincides exactly with this class of regular transduction.

...read moreread less

Abstract: Streaming string transducers define (partial) functions from input strings to output strings. A streaming string transducer makes a single pass through the input string and uses a finite set of variables that range over strings from the output alphabet. At every step, the transducer processes an input symbol, and updates all the variables in parallel using assignments whose right-hand-sides are concatenations of output symbols and variables with the restriction that a variable can be used at most once in a right-hand-side expression. It has been shown that streaming string transducers operating on strings over infinite data domains are of interest in algorithmic verification of list-processing programs, as they lead to Pspace decision procedures for checking pre/postconditions and for checking semantic equivalence, for a well-defined class of heap-manipulating programs. In order to understand the theoretical expressiveness of streaming transducers, we focus on streaming transducers processing strings over finite alphabets, given the existence of a robust and well-studied class of ``regular'' transductions for this case. Such regular transductions can be defined either by two-way deterministic finite-state transducers, or using a logical MSO-based characterization. Our main result is that the expressiveness of streaming string transducers coincides exactly with this class of regular transductions.

...read moreread less

Patent•

3d two-bit-per-cell nand flash memory

[...]

Hsiang-Lan Lung¹, Hang-Ting Lue¹, Yen-Hao Shih¹, Erh-Kun Lai¹, Ming Hsiu Lee¹, Tien-Yen Wang¹ - Show less +2 more•Institutions (1)

National Tsing Hua University¹

21 May 2010

TL;DR: In this paper, a 3D memory device is described which includes bottom and top memory cubes having respective arrays of vertical NAND string structures, and a common source plane comprising a layer of conductive material is between the top and bottom memory cubes.

...read moreread less

Abstract: A 3D memory device is described which includes bottom and top memory cubes having respective arrays of vertical NAND string structures. A common source plane comprising a layer of conductive material is between the top and bottom memory cubes. The source plane is supplied a bias voltage such as ground, and is selectively coupled to an end of the vertical NAND string structures of the bottom and top memory cubes. Memory cells in a particular memory cube are read using current through the particular vertical NAND string between the source plane and a corresponding bit line coupled to another end of the particular vertical NAND string.

...read moreread less

Proceedings Article•DOI•

Cooperative adaptive cruise control, design and experiments

[...]

Gerrit Naus¹, René P A Vugts¹, Jeroen Ploeg, René van de Molengraft¹, M Maarten Steinbuch¹ - Show less +1 more•Institutions (1)

Eindhoven University of Technology¹

29 Jul 2010

TL;DR: The design of a CACC system and corresponding experiments are presented and it is shown that the available wireless information enables small inter-vehicle distances, while maintaining string stable behavior.

...read moreread less

Abstract: The design of a CACC system and corresponding experiments are presented. The design targets string stable system behavior, which is assessed using a frequency-domain-based approach. Following this approach, it is shown that the available wireless information enables small inter-vehicle distances, while maintaining string stable behavior. The theoretical results are validated by experiments with two CACC-equipped vehicles. Measurement results showing string stable as well as string unstable behavior are discussed.

...read moreread less

Proceedings Article•DOI•

The palladio component model

[...]

Steffen Becker¹•Institutions (1)

Forschungszentrum Informatik¹

28 Jan 2010

TL;DR: Action AbstractResourceDemandingActionResource DemandingAction AquireAction ExternalCallAction ParametricResourceDemand demand : String unit : String

...read moreread less

Abstract: Action AbstractResourceDemandingActionResourceDemandingAction AquireAction ExternalCallAction ParametricResourceDemand demand : String unit : String

...read moreread less

Posted Content•

Algorithmic Verification of Single-Pass List Processing Programs

[...]

Rajeev Alur, Pavol Černý

28 Jul 2010-arXiv: Programming Languages

TL;DR: These results lead to algorithms for assertion checking and for checking functional equivalence of two programs, written possibly in different programming styles, for commonly used routines such as insert, delete, and reverse.

...read moreread less

Abstract: We introduce streaming data string transducers that map input data strings to output data strings in a single left-to-right pass in linear time. Data strings are (unbounded) sequences of data values, tagged with symbols from a finite set, over a potentially infinite data domain that supports only the operations of equality and ordering. The transducer uses a finite set of states, a finite set of variables ranging over the data domain, and a finite set of variables ranging over data strings. At every step, it can make decisions based on the next input symbol, updating its state, remembering the input data value in its data variables, and updating data-string variables by concatenating data-string variables and new symbols formed from data variables, while avoiding duplication. We establish that the problems of checking functional equivalence of two streaming transducers, and of checking whether a streaming transducer satisfies pre/post verification conditions specified by streaming acceptors over input/output data-strings, are in PSPACE. We identify a class of imperative and a class of functional programs, manipulating lists of data items, which can be effectively translated to streaming data-string transducers. The imperative programs dynamically modify a singly-linked heap by changing next-pointers of heap-nodes and by adding new nodes. The main restriction specifies how the next-pointers can be used for traversal. We also identify an expressively equivalent fragment of functional programs that traverse a list using syntactically restricted recursive calls. Our results lead to algorithms for assertion checking and for checking functional equivalence of two programs, written possibly in different programming styles, for commonly used routines such as insert, delete, and reverse.

...read moreread less

Patent•

AC LED lamp involving an LED string having separately shortable sections

[...]

Steven Huynh

31 Jul 2010

TL;DR: In this paper, an LED lamp includes a rectifier, an integrated circuit and a string of series-connected LEDs, and each power switch is coupled so that it can separately and selectably short out a corresponding one of several groups of LEDs in the string.

...read moreread less

Abstract: An LED lamp includes a rectifier, an integrated circuit and a string of series-connected LEDs. The lamp receives an incoming AC signal such that a rectified version of the signal is present across the LED string. The integrated circuit includes a plurality of power switches. Each power switch is coupled so that it can separately and selectably short out a corresponding one of several groups of LEDs in the string. As the voltage across the string increases the integrated circuit controls the power switches such that the number of LEDs through which current flows increases, whereas as the voltage across the string decreases the integrated circuit controls the power switches such that the number of LEDs through which current flows decreases. LED string current flow is controlled and regulated to provide superior efficiency, reliability, anti-flicker, regulation against line voltage variations, power factor correction, and lamp over-voltage, over-current, and over-temperature protection.

...read moreread less

Patent•

Natural language processing method and system

[...]

Petrus Matheus Godefridus De Vocht¹•Institutions (1)

Wellington Management Company¹

18 Mar 2010

TL;DR: In this paper, a computer implemented natural language processing method is used to determine sub-components of the sentence string, assigning one or more unique tokens to each determined subcomponent, determining a probability of use that a determined sub-component has one or multiple specific meanings, based on the determined probability for use.

...read moreread less

Abstract: A computer implemented natural language processing method, the method including the steps of: analysing a sentence string within textual information to determine sub-components of the sentence string, assigning one or more unique tokens to each determined sub-component, determining a probability of use that a determined sub-component has one or more specific meanings, based on the determined probability of use, creating a valid set of unique tokens that are associated with the sentence string, and linking verb sub-components associated with one or more of the unique tokens in the valid set of unique tokens to a pre-defined limited sub-set of verbs to create an identification tuple that maps onto the sub-set of verbs.

...read moreread less

Book Chapter•DOI•

Constructing verifiable random functions with large input spaces

[...]

Susan Hohenberger¹, Brent Waters²•Institutions (2)

Johns Hopkins University¹, University of Texas at Austin²

30 May 2010

TL;DR: In this article, the authors present a family of verifiable random functions which are provably secure for exponentially large input spaces under a noninteractive complexity assumption, without a common reference string.

...read moreread less

Abstract: We present a family of verifiable random functions which are provably secure for exponentially-large input spaces under a noninteractive complexity assumption. Prior constructions required either an interactive complexity assumption or one that could tolerate a factor 2n security loss for n-bit inputs. Our construction is practical and inspired by the pseudorandom functions of Naor and Reingold and the verifiable random functions of Lysyanskaya. Set in a bilinear group, where the Decisional Diffie-Hellman problem is easy to solve, we require the l- Decisional Diffie-Hellman Exponent assumption in the standard model, without a common reference string. Our core idea is to apply a simulation technique where the large space of VRF inputs is collapsed into a small (polynomial-size) input in the view of the reduction algorithm. This view, however, is information-theoretically hidden from the attacker. Since the input space is exponentially large, we can first apply a collision-resistant hash function to handle arbitrarily-large inputs.

...read moreread less

Proceedings Article•DOI•

Approximate string search in spatial databases

[...]

Bin Yao¹, Feifei Li¹, Marios Hadjieleftheriou², Kun Hou¹•Institutions (2)

Florida State University¹, AT&T Labs²

01 Mar 2010

TL;DR: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases based on the R-tree augmented with the min-wise signature and the linear hashing technique.

...read moreread less

Abstract: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

...read moreread less

Patent•

Semiconductor memory device and programming method thereof

[...]

Jae-Hoon Jang¹, Jung Dal Choi¹•Institutions (1)

Samsung¹

29 Mar 2010

TL;DR: In this paper, a programming method of a semiconductor memory device includes charging a channel of an inhibit string to a precharge voltage provided to the common source line and boosting the charged channel by providing a wordline voltage to the cell strings.

...read moreread less

Abstract: A programming method of a semiconductor memory device includes charging a channel of an inhibit string to a precharge voltage provided to the common source line and boosting the charged channel by providing a wordline voltage to the cell strings. The inhibit string is connected to a program bitline among the bitlines.

...read moreread less

Proceedings Article•

Simple and Efficient Algorithm for Approximate Dictionary Matching

[...]

Naoaki Okazaki¹, Jun'ichi Tsujii²•Institutions (2)

University of Tokyo¹, University of Manchester²

23 Aug 2010

TL;DR: This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients, called CPMerge, for the τ-overlap join of inverted lists.

...read moreread less

Abstract: This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ-overlap join of inverted lists. First we show that this task is solvable exactly by a τ-overlap join. Given inverted lists retrieved for a query, the algorithm collects fewer candidate strings and prunes unlikely candidates to efficiently find strings that satisfy the constraint of the τ-overlap join. We conducted experiments of approximate dictionary matching on three large-scale datasets that include person names, biomedical names, and general English words. The algorithm exhibited scalable performance on the datasets. For example, it retrieved strings in 1.1 ms from the string collection of Google Web1T unigrams (with cosine similarity and threshold 0.7).

...read moreread less

Posted Content•

Random Access to Grammar Compressed Strings

[...]

Philip Bille¹, Gad M. Landau², Rajeev Raman³, Kunihiko Sadakane⁴, Srinivasa Rao Satti⁵, Oren Weimann⁶ - Show less +2 more•Institutions (6)

Technical University of Denmark¹, University of Haifa², University of Leicester³, National Institute of Informatics⁴, Seoul National University⁵, Weizmann Institute of Science⁶

11 Jan 2010-arXiv: Data Structures and Algorithms

TL;DR: Two representations of a string of length n compressed into a context-free grammar of size n achieving random access time and several new techniques and data structures of independent interest are introduced, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy- paths in grammars.

...read moreread less

Abstract: Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot \alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k^{th}$ row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompress any substring of length $m$ in the same complexity as a single random access query and additional $O(m)$ time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern $P$ with at most $k$ errors in time $O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)$, where $occ$ is the number of occurrences of $P$ in $S$. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.

...read moreread less

Journal Article•DOI•

A robust model for on-line handwritten japanese text recognition

[...]

Bilan Zhu¹, Xiang-Dong Zhou¹, Cheng-Lin Liu², Masaki Nakagawa¹•Institutions (2)

Tokyo University of Agriculture and Technology¹, Chinese Academy of Sciences²

01 Jun 2010

TL;DR: A robust context integration model for on-line handwritten Japanese text recognition based on string class probability approximation that can flexibly combine the scores of various contexts and is insensitive to the variability in path length is described.

...read moreread less

Abstract: This paper describes a robust context integration model for on-line handwritten Japanese text recognition. Based on string class probability approximation, the proposed method evaluates the likelihood of candidate segmentation–recognition paths by combining the scores of character recognition, unary and binary geometric features, as well as linguistic context. The path evaluation criterion can flexibly combine the scores of various contexts and is insensitive to the variability in path length, and so, the optimal segmentation path with its string class can be effectively found by Viterbi search. Moreover, the model parameters are estimated by the genetic algorithm so as to optimize the holistic string recognition performance. In experiments on horizontal text lines extracted from the TUAT Kondate database, the proposed method achieves the segmentation rate of 0.9934 that corresponds to a f-measure and the character recognition rate of 92.80%.

...read moreread less

Proceedings Article•

Integrating Joint n-gram Features into a Discriminative Training Framework

[...]

Sittichai Jiampojamarn¹, Colin Cherry², Grzegorz Kondrak¹•Institutions (2)

University of Alberta¹, National Research Council²

02 Jun 2010

TL;DR: This work includes joint n-gram features inside a state-of-the-art discriminative sequence model for letter-to-phoneme and transliteration transduction and results indicate an improvement in overall performance.

...read moreread less

Abstract: Phonetic string transduction problems, such as letter-to-phoneme conversion and name transliteration, have recently received much attention in the NLP community In the past few years, two methods have come to dominate as solutions to supervised string transduction: generative joint n-gram models, and discriminative sequence models Both approaches benefit from their ability to consider large, flexible spans of source context when making transduction decisions However, they encode this context in different ways, providing their respective models with different information To combine the strengths of these two systems, we include joint n-gram features inside a state-of-the-art discriminative sequence model We evaluate our approach on several letter-to-phoneme and transliteration data sets Our results indicate an improvement in overall performance with respect to both the joint n-gram approach and traditional feature sets for discriminative models

...read moreread less

Patent•

System and method for identifying and labeling fields of text associated with scanned business documents

[...]

John C. Handley¹, M. Armon Rahgozar¹, Dennis L. Venable¹, Pamela B. Spiteri¹, Anoop M. Namboodiri¹, Richard Zanibbi¹ - Show less +2 more•Institutions (1)

Xerox¹

23 Feb 2010

TL;DR: In this article, a system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a businessdocument thereon, to create a bitmap.

...read moreread less

Abstract: A system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a business document thereon, to create a bitmap. A network server carries out a segmentation process to segment the scan generated bitmap into a bitmap object, the bitmap object corresponding to the scanned business document; a bitmap to text conversion process to convert the bitmap object into a block of text; a semantic recognition process to generate a structured representation of semantic entities corresponding to the scanned business document; and a document generation process to convert the structured representation into a structure text file. The semantic recognition process includes the processes of generating, for each line of text having a keyword therein, a terminal symbol corresponding to the keyword therein; generating, for each line of text not having a keyword therein and absent of numeric characters, an alphabetic terminal symbol; generating, for each line of text not having a keyword therein and having a numeric character therein, an alphanumeric terminal symbol; generating a string of terminal symbols from the generated terminal symbols; determining a probable parsing of the generated string of terminal symbols; labeling each text line, according to a determined function, with non-terminal symbols; and parsing the business document information text into fields of business document information text based upon the non-terminal symbol of each text line and the determined probable parsing of the generated string of terminal symbols.

...read moreread less

Journal Article•DOI•

Ensemble clustering using semidefinite programming with applications

[...]

Vikas Singh¹, Lopamudra Mukherjee², Jiming Peng³, Jinhui Xu⁴•Institutions (4)

University of Wisconsin-Madison¹, University of Wisconsin–Whitewater², University of Illinois at Urbana–Champaign³, University at Buffalo⁴

01 May 2010-Machine Learning

TL;DR: This paper shows that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches to ensemble clustering.

...read moreread less

Abstract: In this paper, we study the ensemble clustering problem, where the input is in the form of multiple clustering solutions. The goal of ensemble clustering algorithms is to aggregate the solutions into one solution that maximizes the agreement in the input ensemble. We obtain several new results for this problem. Specifically, we show that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches. Our optimization proceeds by first constructing a non-linear objective function which is then transformed into a 0-1 Semidefinite program (SDP) using novel convexification techniques. This model can be subsequently relaxed to a polynomial time solvable SDP. In addition to the theoretical contributions, our experimental results on standard machine learning and synthetic datasets show that this approach leads to improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. In addition, we identify several new application scenarios for this problem. These include combining multiple image segmentations and generating tissue maps from multiple-channel Diffusion Tensor brain images to identify the underlying structure of the brain.

...read moreread less

Patent•

Identifying Matching Canonical Documents in Response to a Visual Query

[...]

David Petrou¹, Ashok C. Popat¹, Matthew R. Casey¹•Institutions (1)

Google¹

06 Aug 2010

TL;DR: In this article, a server system receives a visual query from a client system, which is an image containing text such as a picture of a document, and performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters.

...read moreread less

Abstract: A server system receives a visual query from a client system. The visual query is an image containing text such as a picture of a document. At the receiving server or another server, optical character recognition (OCR) is performed on the visual query to produce text recognition data representing textual characters. Each character in a contiguous region of the visual query is individually scored according to its quality. The quality score of a respective character is influenced by the quality scores of neighboring or nearby characters. Using the scores, one or more high quality strings of characters are identified. Each high quality string has a plurality of high quality characters. A canonical document containing the one or more high quality textual strings is retrieved. At least a portion of the canonical document is sent to the client system.

...read moreread less

Proceedings Article•

Fast algorithms for top-k approximate string matching

[...]

Zhenglu Yang¹, Jianjun Yu², Masaru Kitsuregawa¹•Institutions (2)

University of Tokyo¹, Chinese Academy of Sciences²

11 Jul 2010

TL;DR: This paper presents a general q-gram based framework and proposes two efficient algorithms based on the strategies introduced that show a superior performance in the efficient top-k similar string matching problem.

...read moreread less

Abstract: Top-k approximate querying on string collections is an important data analysis tool for many applications, and it has been exhaustively studied However, the scale of the problem has increased dramatically because of the prevalence of the Web In this paper, we aim to explore the efficient top-k similar string matching problem Several efficient strategies are introduced, such as length aware and adaptive q-gram selection We present a general q-gram based framework and propose two efficient algorithms based on the strategies introduced Our techniques are experimentally evaluated on three real data sets and show a superior performance

...read moreread less

Proceedings Article•DOI•

A learning algorithm for top-down XML transformations

[...]

Aurélien Lemay¹, Sebastian Maneth², Joachim Niehren¹•Institutions (2)

university of lille¹, NICTA²

06 Jun 2010

TL;DR: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size which allows to infer the translation.

...read moreread less

Abstract: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size (with respect to the minimal transducer) which allows to infer the translation. Until now, only for string transducers and for simple relabeling tree transducers, similar results had been known. Learning of deterministic top-down tree transducers (dtops) is far more involved because a dtop can copy, delete, and permute its input subtrees. Thus, complex dependencies of labeled input to output paths need to be maintained by the algorithm. First, a Myhill-Nerode theorem is presented for dtops, which is interesting on its own. This theorem is then used to construct a learning algorithm for dtops. Finally, it is shown how our result can be applied to xml transformations (e.g. xslt programs). For this, a new dtd-based encoding of unranked trees by ranked ones is presented. Over such encodings, dtops can realize many practically interesting xml transformations which cannot be realized on firstchild/next-sibling encodings.

...read moreread less

Collapse