scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2010"


Journal ArticleDOI
TL;DR: Implementation of the CACC system, the string-stability characteristics of the practical setup, and experimental results are discussed, indicating the advantages of the design over standard adaptive-cruise-control functionality.
Abstract: The design of a cooperative adaptive cruise-control (CACC) system and its practical validation are presented. Focusing on the feasibility of implementation, a decentralized controller design with a limited communication structure is proposed (in this case, a wireless communication link with the nearest preceding vehicle only). A necessary and sufficient frequency-domain condition for string stability is derived, taking into account heterogeneous traffic, i.e., vehicles with possibly different characteristics. For a velocity-dependent intervehicle spacing policy, it is shown that the wireless communication link enables driving at small intervehicle distances, whereas string stability is guaranteed. For a constant velocity-independent intervehicle spacing, string stability cannot be guaranteed. To validate the theoretical results, experiments are performed with two CACC-equipped vehicles. Implementation of the CACC system, the string-stability characteristics of the practical setup, and experimental results are discussed, indicating the advantages of the design over standard adaptive-cruise-control functionality.

779 citations


Proceedings ArticleDOI
16 May 2010
TL;DR: This paper builds an automatic end-to-end tool, Kudzu, and applies it to the problem of finding client-side code injection vulnerabilities, and designs a new language of string constraints and implements a solver for it.
Abstract: As AJAX applications gain popularity, client-side JavaScript code is becoming increasingly complex. However, few automated vulnerability analysis tools for JavaScript exist. In this paper, we describe the first system for exploring the execution space of JavaScript code using symbolic execution. To handle JavaScript code’s complex use of string operations, we design a new language of string constraints and implement a solver for it. We build an automatic end-to-end tool, Kudzu, and apply it to the problem of finding client-side code injection vulnerabilities. In experiments on 18 live web applications, Kudzu automatically discovers 2 previously unknown vulnerabilities and 9 more that were previously found only with a manually-constructed test suite.

474 citations


Book ChapterDOI
05 Dec 2010
TL;DR: This work constructs non-interactive zero-knowledge arguments for circuit satisfiability with perfect completeness, perfect zero- knowledge and computational soundness and security is based on two new cryptographic assumptions.
Abstract: We construct non-interactive zero-knowledge arguments for circuit satisfiability with perfect completeness, perfect zero-knowledge and computational soundness. The non-interactive zero-knowledge arguments have sub-linear size and very efficient public verification. The size of the non-interactive zero-knowledge arguments can even be reduced to a constant number of group elements if we allow the common reference string to be large. Our constructions rely on groups with pairings and security is based on two new cryptographic assumptions; we do not use the Fiat-Shamir heuristic or random oracles.

457 citations


Book ChapterDOI
11 Oct 2010
TL;DR: This paper uses Lyndon words and introduces the Lyndon structure of runs as a useful tool when computing powers and presents an efficient algorithm for testing primitivity of factors of a string and computing their primitive roots.
Abstract: A breakthrough in the field of text algorithms was the discovery of the fact that the maximal number of runs in a string of length n is O(n) and that they can all be computed in O(n) time. We study some applications of this result. New simpler O(n) time algorithms are presented for a few classical string problems: computing all distinct kth string powers for a given k, in particular squares for k = 2, and finding all local periods in a given string of length n. Additionally, we present an efficient algorithm for testing primitivity of factors of a string and computing their primitive roots. Applications of runs, despite their importance, are underrepresented in existing literature (approximately one page in the paper of Kolpakov & Kucherov, 1999). In this paper we attempt to fill in this gap. We use Lyndon words and introduce the Lyndon structure of runs as a useful tool when computing powers. In problems related to periods we use some versions of the Manhattan skyline problem.

439 citations


Journal ArticleDOI
TL;DR: The algorithms presented here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly.
Abstract: Motivation: Sequence assembly is a difficult problem whose importance has grown again recently as the cost of sequencing has dramatically dropped. Most new sequence assembly software has started by building a de Bruijn graph, avoiding the overlap-based methods used previously because of the computational cost and complexity of these with very large numbers of short reads. Here, we show how to use suffix array-based methods that have formed the basis of recent very fast sequence mapping algorithms to find overlaps and generate assembly string graphs asymptotically faster than previously described algorithms. Results: Standard overlap assembly methods have time complexity O(N2), where N is the sum of the lengths of the reads. We use the Ferragina–Manzini index (FM-index) derived from the Burrows–Wheeler transform to find overlaps of length at least τ among a set of reads. As well as an approach that finds all overlaps then implements transitive reduction to produce a string graph, we show how to output directly only the irreducible overlaps, significantly shrinking memory requirements and reducing compute time to O(N), independent of depth. Overlap-based assembly methods naturally handle mixed length read sets, including capillary reads or long reads promised by the third generation sequencing technologies. The algorithms we present here pave the way for overlap-based assembly approaches to be developed that scale to whole vertebrate genome de novo assembly. Contact: js18@sanger.ac.uk

267 citations


Journal Article
TL;DR: The upper bound of 0.5 n on the maximal number of runs in a string of length n has been shown in this article, and the lower bound is 0.406 n.
Abstract: A run is an inclusion maximal occurrence in a string (as a subinterval) of a repetition v with a period p such that 2p≤|v|. The maximal number of runs in a string of length n has been thoroughly studied, and is known to be between 0.944 n and 1.029 n. In this paper we investigate cubic runs, in which the shortest period p satisfies 3p≤|v|. We show the upper bound of 0.5 n on the maximal number of such runs in a string of length n, and construct an infinite sequence of words over binary alphabet for which the lower bound is 0.406 n.

266 citations


Patent
27 Oct 2010
TL;DR: In this article, the pre-processing may include determining whether the plurality of values meets a randomness condition, a length condition, and/or a string ratio condition; the candidate data subset is inspected for computer instructions, characteristics of the computer instructions are determined, and a predetermined action taken based on the characteristics of computer instructions.
Abstract: Detecting executable machine instructions in a data is accomplished by accessing a plurality of values representing data contained within a memory of a computer system and performing pre-processing on the plurality of values to produce a candidate data subset. The pre-processing may include determining whether the plurality of values meets (a) a randomness condition, (b) a length condition, and/or (c) a string ratio condition. The candidate data subset is inspected for computer instructions, characteristics of the computer instructions are determined, and a predetermined action taken based on the characteristics of the computer instructions.

187 citations


Patent
Jae-Hun Jeong1, Han-soo Kim, Wonseok Cho, Jae-Hoon Jang, Sunil Shim 
02 Feb 2010
TL;DR: In this paper, the first selection line is connected to the at least one pair of first selection transistors of the NAND string and a plurality of word lines are coupled to the plurality of memory cells.
Abstract: A non-volatile memory device having a vertical structure includes a NAND string having a vertical structure. The NAND string includes a plurality of memory cells, and at least one pair of first selection transistors arranged to be adjacent to a first end of the plurality of memory cells. A plurality of word lines are coupled to the plurality of memory cells of the NAND string. A first selection line is commonly connected to the at least one pair of first selection transistors of the NAND string.

182 citations


Book ChapterDOI
20 Mar 2010
TL;DR: Stranger can automatically prove that an application is free from specified attacks or generate vulnerability signatures that characterize all malicious inputs that can be used to generate attacks.
Abstract: Stranger is an automata-based string analysis tool for finding and eliminating string-related security vulnerabilities in PHP applications. Stranger uses symbolic forward and backward reachability analyses to compute the possible values that the string expressions can take during program execution. Stranger can automatically (1) prove that an application is free from specified attacks or (2) generate vulnerability signatures that characterize all malicious inputs that can be used to generate attacks.

152 citations


Posted Content
TL;DR: In this paper, it was shown that any boolean function can be evaluated optimally by a quantum query algorithm that alternates a fixed, input-independent reflection with a second reflection that coherently queries the input string.
Abstract: We show that any boolean function can be evaluated optimally by a quantum query algorithm that alternates a certain fixed, input-independent reflection with a second reflection that coherently queries the input string. Originally introduced for solving the unstructured search problem, this two-reflections structure is therefore a universal feature of quantum algorithms. Our proof goes via the general adversary bound, a semi-definite program (SDP) that lower-bounds the quantum query complexity of a function. By a quantum algorithm for evaluating span programs, this lower bound is known to be tight up to a sub-logarithmic factor. The extra factor comes from converting a continuous-time query algorithm into a discrete-query algorithm. We give a direct and simplified quantum algorithm based on the dual SDP, with a bounded-error query complexity that matches the general adversary bound. Therefore, the general adversary lower bound is tight; it is in fact an SDP for quantum query complexity. This implies that the quantum query complexity of the composition f(g,...,g) of two boolean functions f and g matches the product of the query complexities of f and g, without a logarithmic factor for error reduction. It further shows that span programs are equivalent to quantum query algorithms.

148 citations


Journal ArticleDOI
01 Sep 2010
TL;DR: This paper proposes a novel framework called trie-join, which uses a trie structure to index the strings and utilizes the triE structure to efficiently find the similar string pairs based on subtrie pruning.
Abstract: A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and have the following disadvantages: (1) They are inefficient for the data sets with short strings (the average string length is no larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel framework called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find the similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on three real data sets with short strings.

Proceedings ArticleDOI
01 Jan 2010
TL;DR: This work focuses on streaming transducers processing strings over finite alphabets, given the existence of a robust and well-studied class of ``regular'' transductions for this case, and finds that the expressiveness of streaming string transducers coincides exactly with this class of regular transduction.
Abstract: Streaming string transducers define (partial) functions from input strings to output strings. A streaming string transducer makes a single pass through the input string and uses a finite set of variables that range over strings from the output alphabet. At every step, the transducer processes an input symbol, and updates all the variables in parallel using assignments whose right-hand-sides are concatenations of output symbols and variables with the restriction that a variable can be used at most once in a right-hand-side expression. It has been shown that streaming string transducers operating on strings over infinite data domains are of interest in algorithmic verification of list-processing programs, as they lead to Pspace decision procedures for checking pre/postconditions and for checking semantic equivalence, for a well-defined class of heap-manipulating programs. In order to understand the theoretical expressiveness of streaming transducers, we focus on streaming transducers processing strings over finite alphabets, given the existence of a robust and well-studied class of ``regular'' transductions for this case. Such regular transductions can be defined either by two-way deterministic finite-state transducers, or using a logical MSO-based characterization. Our main result is that the expressiveness of streaming string transducers coincides exactly with this class of regular transductions.

Patent
21 May 2010
TL;DR: In this paper, a 3D memory device is described which includes bottom and top memory cubes having respective arrays of vertical NAND string structures, and a common source plane comprising a layer of conductive material is between the top and bottom memory cubes.
Abstract: A 3D memory device is described which includes bottom and top memory cubes having respective arrays of vertical NAND string structures. A common source plane comprising a layer of conductive material is between the top and bottom memory cubes. The source plane is supplied a bias voltage such as ground, and is selectively coupled to an end of the vertical NAND string structures of the bottom and top memory cubes. Memory cells in a particular memory cube are read using current through the particular vertical NAND string between the source plane and a corresponding bit line coupled to another end of the particular vertical NAND string.

Proceedings ArticleDOI
29 Jul 2010
TL;DR: The design of a CACC system and corresponding experiments are presented and it is shown that the available wireless information enables small inter-vehicle distances, while maintaining string stable behavior.
Abstract: The design of a CACC system and corresponding experiments are presented. The design targets string stable system behavior, which is assessed using a frequency-domain-based approach. Following this approach, it is shown that the available wireless information enables small inter-vehicle distances, while maintaining string stable behavior. The theoretical results are validated by experiments with two CACC-equipped vehicles. Measurement results showing string stable as well as string unstable behavior are discussed.

Proceedings ArticleDOI
28 Jan 2010
TL;DR: Action AbstractResourceDemandingActionResource DemandingAction AquireAction ExternalCallAction ParametricResourceDemand demand : String unit : String
Abstract: Action AbstractResourceDemandingActionResourceDemandingAction AquireAction ExternalCallAction ParametricResourceDemand demand : String unit : String

Posted Content
TL;DR: These results lead to algorithms for assertion checking and for checking functional equivalence of two programs, written possibly in different programming styles, for commonly used routines such as insert, delete, and reverse.
Abstract: We introduce streaming data string transducers that map input data strings to output data strings in a single left-to-right pass in linear time. Data strings are (unbounded) sequences of data values, tagged with symbols from a finite set, over a potentially infinite data domain that supports only the operations of equality and ordering. The transducer uses a finite set of states, a finite set of variables ranging over the data domain, and a finite set of variables ranging over data strings. At every step, it can make decisions based on the next input symbol, updating its state, remembering the input data value in its data variables, and updating data-string variables by concatenating data-string variables and new symbols formed from data variables, while avoiding duplication. We establish that the problems of checking functional equivalence of two streaming transducers, and of checking whether a streaming transducer satisfies pre/post verification conditions specified by streaming acceptors over input/output data-strings, are in PSPACE. We identify a class of imperative and a class of functional programs, manipulating lists of data items, which can be effectively translated to streaming data-string transducers. The imperative programs dynamically modify a singly-linked heap by changing next-pointers of heap-nodes and by adding new nodes. The main restriction specifies how the next-pointers can be used for traversal. We also identify an expressively equivalent fragment of functional programs that traverse a list using syntactically restricted recursive calls. Our results lead to algorithms for assertion checking and for checking functional equivalence of two programs, written possibly in different programming styles, for commonly used routines such as insert, delete, and reverse.

Patent
31 Jul 2010
TL;DR: In this paper, an LED lamp includes a rectifier, an integrated circuit and a string of series-connected LEDs, and each power switch is coupled so that it can separately and selectably short out a corresponding one of several groups of LEDs in the string.
Abstract: An LED lamp includes a rectifier, an integrated circuit and a string of series-connected LEDs. The lamp receives an incoming AC signal such that a rectified version of the signal is present across the LED string. The integrated circuit includes a plurality of power switches. Each power switch is coupled so that it can separately and selectably short out a corresponding one of several groups of LEDs in the string. As the voltage across the string increases the integrated circuit controls the power switches such that the number of LEDs through which current flows increases, whereas as the voltage across the string decreases the integrated circuit controls the power switches such that the number of LEDs through which current flows decreases. LED string current flow is controlled and regulated to provide superior efficiency, reliability, anti-flicker, regulation against line voltage variations, power factor correction, and lamp over-voltage, over-current, and over-temperature protection.

Patent
18 Mar 2010
TL;DR: In this paper, a computer implemented natural language processing method is used to determine sub-components of the sentence string, assigning one or more unique tokens to each determined subcomponent, determining a probability of use that a determined sub-component has one or multiple specific meanings, based on the determined probability for use.
Abstract: A computer implemented natural language processing method, the method including the steps of: analysing a sentence string within textual information to determine sub-components of the sentence string, assigning one or more unique tokens to each determined sub-component, determining a probability of use that a determined sub-component has one or more specific meanings, based on the determined probability of use, creating a valid set of unique tokens that are associated with the sentence string, and linking verb sub-components associated with one or more of the unique tokens in the valid set of unique tokens to a pre-defined limited sub-set of verbs to create an identification tuple that maps onto the sub-set of verbs.

Book ChapterDOI
30 May 2010
TL;DR: In this article, the authors present a family of verifiable random functions which are provably secure for exponentially large input spaces under a noninteractive complexity assumption, without a common reference string.
Abstract: We present a family of verifiable random functions which are provably secure for exponentially-large input spaces under a noninteractive complexity assumption. Prior constructions required either an interactive complexity assumption or one that could tolerate a factor 2n security loss for n-bit inputs. Our construction is practical and inspired by the pseudorandom functions of Naor and Reingold and the verifiable random functions of Lysyanskaya. Set in a bilinear group, where the Decisional Diffie-Hellman problem is easy to solve, we require the l- Decisional Diffie-Hellman Exponent assumption in the standard model, without a common reference string. Our core idea is to apply a simulation technique where the large space of VRF inputs is collapsed into a small (polynomial-size) input in the view of the reduction algorithm. This view, however, is information-theoretically hidden from the attacker. Since the input space is exponentially large, we can first apply a collision-resistant hash function to handle arbitrarily-large inputs.

Proceedings ArticleDOI
01 Mar 2010
TL;DR: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases based on the R-tree augmented with the min-wise signature and the linear hashing technique.
Abstract: This work presents a novel index structure, MHR-tree, for efficiently answering approximate string match queries in large spatial databases. The MHR-tree is based on the R-tree augmented with the min-wise signature and the linear hashing technique. The min-wise signature for an index node u keeps a concise representation of the union of q-grams from strings under the sub-tree of u. We analyze the pruning functionality of such signatures based on set resemblance between the query string and the q-grams from the sub-trees of index nodes. MHR-tree supports a wide range of query predicates efficiently, including range and nearest neighbor queries. We also discuss how to estimate range query selectivity accurately. We present a novel adaptive algorithm for finding balanced partitions using both the spatial and string information stored in the tree. Extensive experiments on large real data sets demonstrate the efficiency and effectiveness of our approach.

Patent
Jae-Hoon Jang1, Jung Dal Choi1
29 Mar 2010
TL;DR: In this paper, a programming method of a semiconductor memory device includes charging a channel of an inhibit string to a precharge voltage provided to the common source line and boosting the charged channel by providing a wordline voltage to the cell strings.
Abstract: A programming method of a semiconductor memory device includes charging a channel of an inhibit string to a precharge voltage provided to the common source line and boosting the charged channel by providing a wordline voltage to the cell strings. The inhibit string is connected to a program bitline among the bitlines.

Proceedings Article
23 Aug 2010
TL;DR: This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients, called CPMerge, for the τ-overlap join of inverted lists.
Abstract: This paper presents a simple and efficient algorithm for approximate dictionary matching designed for similarity measures such as cosine, Dice, Jaccard, and overlap coefficients. We propose this algorithm, called CPMerge, for the τ-overlap join of inverted lists. First we show that this task is solvable exactly by a τ-overlap join. Given inverted lists retrieved for a query, the algorithm collects fewer candidate strings and prunes unlikely candidates to efficiently find strings that satisfy the constraint of the τ-overlap join. We conducted experiments of approximate dictionary matching on three large-scale datasets that include person names, biomedical names, and general English words. The algorithm exhibited scalable performance on the datasets. For example, it retrieved strings in 1.1 ms from the string collection of Google Web1T unigrams (with cosine similarity and threshold 0.7).

Posted Content
TL;DR: Two representations of a string of length n compressed into a context-free grammar of size n achieving random access time and several new techniques and data structures of independent interest are introduced, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy- paths in grammars.
Abstract: Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot \alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k^{th}$ row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompress any substring of length $m$ in the same complexity as a single random access query and additional $O(m)$ time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression. For instance, we can find all approximate occurrences of a pattern $P$ with at most $k$ errors in time $O(n(\min\{|P|k, k^4 + |P|\} + \log N) + occ)$, where $occ$ is the number of occurrences of $P$ in $S$. Finally, we generalize our results to navigation and other operations on grammar-compressed ordered trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy paths in grammars.

Journal ArticleDOI
01 Jun 2010
TL;DR: A robust context integration model for on-line handwritten Japanese text recognition based on string class probability approximation that can flexibly combine the scores of various contexts and is insensitive to the variability in path length is described.
Abstract: This paper describes a robust context integration model for on-line handwritten Japanese text recognition. Based on string class probability approximation, the proposed method evaluates the likelihood of candidate segmentation–recognition paths by combining the scores of character recognition, unary and binary geometric features, as well as linguistic context. The path evaluation criterion can flexibly combine the scores of various contexts and is insensitive to the variability in path length, and so, the optimal segmentation path with its string class can be effectively found by Viterbi search. Moreover, the model parameters are estimated by the genetic algorithm so as to optimize the holistic string recognition performance. In experiments on horizontal text lines extracted from the TUAT Kondate database, the proposed method achieves the segmentation rate of 0.9934 that corresponds to a f-measure and the character recognition rate of 92.80%.

Proceedings Article
02 Jun 2010
TL;DR: This work includes joint n-gram features inside a state-of-the-art discriminative sequence model for letter-to-phoneme and transliteration transduction and results indicate an improvement in overall performance.
Abstract: Phonetic string transduction problems, such as letter-to-phoneme conversion and name transliteration, have recently received much attention in the NLP community In the past few years, two methods have come to dominate as solutions to supervised string transduction: generative joint n-gram models, and discriminative sequence models Both approaches benefit from their ability to consider large, flexible spans of source context when making transduction decisions However, they encode this context in different ways, providing their respective models with different information To combine the strengths of these two systems, we include joint n-gram features inside a state-of-the-art discriminative sequence model We evaluate our approach on several letter-to-phoneme and transliteration data sets Our results indicate an improvement in overall performance with respect to both the joint n-gram approach and traditional feature sets for discriminative models

Patent
23 Feb 2010
TL;DR: In this article, a system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a businessdocument thereon, to create a bitmap.
Abstract: A system for electronically distilling information from a business document uses a network scanner to electronically scan a platen area, having a business document thereon, to create a bitmap. A network server carries out a segmentation process to segment the scan generated bitmap into a bitmap object, the bitmap object corresponding to the scanned business document; a bitmap to text conversion process to convert the bitmap object into a block of text; a semantic recognition process to generate a structured representation of semantic entities corresponding to the scanned business document; and a document generation process to convert the structured representation into a structure text file. The semantic recognition process includes the processes of generating, for each line of text having a keyword therein, a terminal symbol corresponding to the keyword therein; generating, for each line of text not having a keyword therein and absent of numeric characters, an alphabetic terminal symbol; generating, for each line of text not having a keyword therein and having a numeric character therein, an alphanumeric terminal symbol; generating a string of terminal symbols from the generated terminal symbols; determining a probable parsing of the generated string of terminal symbols; labeling each text line, according to a determined function, with non-terminal symbols; and parsing the business document information text into fields of business document information text based upon the non-terminal symbol of each text line and the determined probable parsing of the generated string of terminal symbols.

Journal ArticleDOI
TL;DR: This paper shows that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches to ensemble clustering.
Abstract: In this paper, we study the ensemble clustering problem, where the input is in the form of multiple clustering solutions. The goal of ensemble clustering algorithms is to aggregate the solutions into one solution that maximizes the agreement in the input ensemble. We obtain several new results for this problem. Specifically, we show that the notion of agreement under such circumstances can be better captured using a 2D string encoding rather than a voting strategy, which is common among existing approaches. Our optimization proceeds by first constructing a non-linear objective function which is then transformed into a 0-1 Semidefinite program (SDP) using novel convexification techniques. This model can be subsequently relaxed to a polynomial time solvable SDP. In addition to the theoretical contributions, our experimental results on standard machine learning and synthetic datasets show that this approach leads to improvements not only in terms of the proposed agreement measure but also the existing agreement measures based on voting strategies. In addition, we identify several new application scenarios for this problem. These include combining multiple image segmentations and generating tissue maps from multiple-channel Diffusion Tensor brain images to identify the underlying structure of the brain.

Patent
06 Aug 2010
TL;DR: In this article, a server system receives a visual query from a client system, which is an image containing text such as a picture of a document, and performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters.
Abstract: A server system receives a visual query from a client system. The visual query is an image containing text such as a picture of a document. At the receiving server or another server, optical character recognition (OCR) is performed on the visual query to produce text recognition data representing textual characters. Each character in a contiguous region of the visual query is individually scored according to its quality. The quality score of a respective character is influenced by the quality scores of neighboring or nearby characters. Using the scores, one or more high quality strings of characters are identified. Each high quality string has a plurality of high quality characters. A canonical document containing the one or more high quality textual strings is retrieved. At least a portion of the canonical document is sent to the client system.

Proceedings Article
11 Jul 2010
TL;DR: This paper presents a general q-gram based framework and proposes two efficient algorithms based on the strategies introduced that show a superior performance in the efficient top-k similar string matching problem.
Abstract: Top-k approximate querying on string collections is an important data analysis tool for many applications, and it has been exhaustively studied However, the scale of the problem has increased dramatically because of the prevalence of the Web In this paper, we aim to explore the efficient top-k similar string matching problem Several efficient strategies are introduced, such as length aware and adaptive q-gram selection We present a general q-gram based framework and propose two efficient algorithms based on the strategies introduced Our techniques are experimentally evaluated on three real data sets and show a superior performance

Proceedings ArticleDOI
06 Jun 2010
TL;DR: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size which allows to infer the translation.
Abstract: A generalization from string to trees and from languages to translations is given of the classical result that any regular language can be learned from examples: it is shown that for any deterministic top-down tree transformation there exists a sample set of polynomial size (with respect to the minimal transducer) which allows to infer the translation. Until now, only for string transducers and for simple relabeling tree transducers, similar results had been known. Learning of deterministic top-down tree transducers (dtops) is far more involved because a dtop can copy, delete, and permute its input subtrees. Thus, complex dependencies of labeled input to output paths need to be maintained by the algorithm. First, a Myhill-Nerode theorem is presented for dtops, which is interesting on its own. This theorem is then used to construct a learning algorithm for dtops. Finally, it is shown how our result can be applied to xml transformations (e.g. xslt programs). For this, a new dtd-based encoding of unranked trees by ranked ones is presented. Over such encodings, dtops can realize many practically interesting xml transformations which cannot be realized on firstchild/next-sibling encodings.