Showing papers on "String (computer science) published in 2011"

PDF

Open Access

Journal Article•DOI•

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers

[...]

Guillaume Marçais¹, Carl Kingsford¹•Institutions (1)

01 Mar 2011-Bioinformatics

TL;DR: This work proposes a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient, based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length.

...read moreread less

Abstract: Motivation: Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm. Results: We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution. Availability: The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish. Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

...read moreread less

2,779 citations

Journal Article•DOI•

Automating string processing in spreadsheets using input-output examples

[...]

Sumit Gulwani¹•Institutions (1)

Microsoft¹

26 Jan 2011

TL;DR: The design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops is described and an algorithm based on several novel concepts for synthesizing a desired program in this language is described from input-output examples.

...read moreread less

Abstract: We describe the design of a string programming/expression language that supports restricted forms of regular expressions, conditionals and loops. The language is expressive enough to represent a wide variety of string manipulation tasks that end-users struggle with. We describe an algorithm based on several novel concepts for synthesizing a desired program in this language from input-output examples. The synthesis algorithm is very efficient taking a fraction of a second for various benchmark examples. The synthesis algorithm is interactive and has several desirable features: it can rank multiple solutions and has fast convergence, it can detect noise in the user input, and it supports an active interaction model wherein the user is prompted to provide outputs on inputs that may have multiple computational interpretations.The algorithm has been implemented as an interactive add-in for Microsoft Excel spreadsheet system. The prototype tool has met the golden test - it has synthesized part of itself, and has been used to solve problems beyond author's imagination.

...read moreread less

801 citations

Journal Article•DOI•

Local stabilizer codes in three dimensions without string logical operators

[...]

Jeongwan Haah¹•Institutions (1)

California Institute of Technology¹

22 Apr 2011-Physical Review A

TL;DR: It is proved that every stringlike logical operator of this code can be deformed to a disjoint union of short segments, each of which is in the stabilizer group, and introduced a notion of "logical string segments" to avoid difficulties in defining one-dimensional objects in discrete lattices.

...read moreread less

Abstract: We suggest concrete models for self-correcting quantum memory by reporting examples of local stabilizer codes in 3D that have no string logical operators. Previously known local stabilizer codes in 3D all have stringlike logical operators, which make the codes non-self-correcting. We introduce a notion of “logical string segments” to avoid difficulties in defining one-dimensional objects in discrete lattices. We prove that every stringlike logical operator of our code can be deformed to a disjoint union of short segments, each of which is in the stabilizer group. The code has surfacelike logical operators whose partial implementation has unsatisfied stabilizers along its boundary.

...read moreread less

610 citations

Patent•

Nonvolatile memory device, operating method thereof and memory system including the same

[...]

Chi Weon Yoon¹, Dong-Hyuk Chae¹, Jae-Woo Park¹, Sang-Wan Nam¹•Institutions (1)

Samsung¹

16 Feb 2011

TL;DR: In this article, the erasing operation to memory cells associated with a plurality of string selection lines (SSLs), the memory cells associating with the plurality of SSLs constituting a memory block, was verified.

...read moreread less

Abstract: A method of operating a non-volatile memory device includes performing an erasing operation to memory cells associated with a plurality of string selection lines (SSLs), the memory cells associated with the plurality of SSLs constituting a memory block, and verifying the erasing operation to second memory cells associated with a second SSL after verifying the erasing operation to first memory cells associated with a first SSL.

...read moreread less

497 citations

Journal Article•DOI•

Practical String Stability of Platoon of Adaptive Cruise Control Vehicles

[...]

Lingyun Xiao¹, Feng Gao²•Institutions (2)

General Administration of Quality Supervision, Inspection and Quarantine¹, Beihang University²

01 Dec 2011-IEEE Transactions on Intelligent Transportation Systems

TL;DR: This paper provides a practical means to evaluate the ACC systems applying the sliding-mode controller and provides a reasonable proposal to design the ACC controller from the perspective of the practical string stability.

...read moreread less

Abstract: In this paper, the practical string stability of both homogeneous and heterogeneous platoons of adaptive cruise control (ACC) vehicles, which apply the constant time headway spacing policy, is investigated by considering the parasitic time delays and lags of the actuators and sensors when building the vehicle longitudinal dynamics model. The proposed control law based on the sliding-mode controller can guarantee both homogeneous and heterogeneous string stability, if the control parameters and system parameters meet certain requirements. The analysis of the negative effect of the parasitic time delays and lags on the string stability indicates that the negative effect of the time delays is larger than that of the time lags. This paper provides a practical means to evaluate the ACC systems applying the sliding-mode controller and provides a reasonable proposal to design the ACC controller from the perspective of the practical string stability.

...read moreread less

403 citations

Journal Article•DOI•

Private randomness expansion with untrusted devices

[...]

Roger Colbeck¹, Adrian Kent², Adrian Kent¹•Institutions (2)

Perimeter Institute for Theoretical Physics¹, University of Cambridge²

04 Mar 2011-Journal of Physics A

TL;DR: This work introduces a protocol for private randomness expansion with untrusted devices which is designed to take as input an initially private random string and produce as output a longerPrivate random string.

...read moreread less

Abstract: Randomness is an important resource for many applications, from gambling to secure communication. However, guaranteeing that the output from a candidate random source could not have been predicted by an outside party is a challenging task, and many supposedly random sources used today provide no such guarantee. Quantum solutions to this problem exist, for example a device which internally sends a photon through a beamsplitter and observes on which side it emerges, but, presently, such solutions require the user to trust the internal workings of the device. Here, we seek to go beyond this limitation by asking whether randomness can be generated using untrusted devices—even ones created by an adversarial agent—while providing a guarantee that no outside party (including the agent) can predict it. Since this is easily seen to be impossible unless the user has an initially private random string, the task we investigate here is private randomness expansion. We introduce a protocol for private randomness expansion with untrusted devices which is designed to take as input an initially private random string and produce as output a longer private random string. We point out that private randomness expansion protocols are generally vulnerable to attacks that can render the initial string partially insecure, even though that string is used only inside a secure laboratory; our protocol is designed to remove this previously unconsidered vulnerability by privacy amplification. We also discuss extensions of our protocol designed to generate an arbitrarily long random string from a finite initially private random string. The security of these protocols against the most general attacks is left as an open question.

...read moreread less

348 citations

Proceedings Article•DOI•

Fast-join: An efficient method for fuzzy token matching based string similarity join

[...]

Jiannan Wang¹, Guoliang Li¹, Jianhua Fe¹•Institutions (1)

Tsinghua University¹

11 Apr 2011

TL;DR: This paper proposes a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions by allowing fuzzy match between two tokens, and achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.

...read moreread less

Abstract: String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.

...read moreread less

137 citations

Patent•

Memory architecture of 3D array with alternating memory string orientation and string select structures

[...]

Shih-Hung Chen¹, Hang-Ting Lue¹•Institutions (1)

National Tsing Hua University¹

01 Apr 2011

TL;DR: In this article, a 3D memory device includes a plurality of ridge-shaped stacks, in the form of multiple strips of conductive material separated by insulating material, arranged as bit lines which can be coupled through decoding circuits to sense amplifiers.

...read moreread less

Abstract: A 3D memory device includes a plurality of ridge-shaped stacks, in the form of multiple strips of conductive material separated by insulating material, arranged as bit lines which can be coupled through decoding circuits to sense amplifiers. Diodes are connected to the bit lines at either the string select of common source select ends of the strings. The strips of conductive material have side surfaces on the sides of the ridge-shaped stacks. A plurality of word lines, which can be coupled to row decoders, extends orthogonally over the plurality of ridge-shaped stacks. Memory elements lie in a multi-layer array of interface regions at cross-points between side surfaces of the semiconductor strips on the stacks and the word lines.

...read moreread less

119 citations

Journal Article•DOI•

A Memory-Efficient Bit-Split Parallel String Matching Using Pattern Dividing for Intrusion Detection Systems

[...]

Jin Xu¹, Albert Y. S. Lam², Victor O. K. Li¹•Institutions (2)

University of Hong Kong¹, University of California, Berkeley²

01 Oct 2011-IEEE Transactions on Parallel and Distributed Systems

TL;DR: In this article, a memory-efficient parallel string matching scheme is proposed for low-cost hardware-based intrusion detection systems, where long target patterns are divided into sub-patterns with a fixed length.

...read moreread less

Abstract: For the low-cost hardware-based intrusion detection systems, this paper proposes a memory-efficient parallel string matching scheme. In order to reduce the number of state transitions, the finite state machine tiles in a string matcher adopt bit-level input symbols. Long target patterns are divided into subpatterns with a fixed length; deterministic finite automata are built with the subpatterns. Using the pattern dividing, the variety of target pattern lengths can be mitigated, so that memory usage in homogeneous string matchers can be efficient. In order to identify each original long pattern being divided, a two-stage sequential matching scheme is proposed for the successive matches with subpatterns. Experimental results show that total memory requirements decrease on average by 47.8 percent and 62.8 percent for Snort and ClamAV rule sets, in comparison with several existing bit-split string matching methods.

...read moreread less

114 citations

Proceedings Article•

Bootstrapped Named Entity Recognition for Product Attribute Extraction

[...]

Duangmanee Putthividhya¹, Junling Hu¹•Institutions (1)

eBay¹

27 Jul 2011

TL;DR: Focusing on listings from eBay's clothing and shoes categories, the bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands.

...read moreread less

Abstract: We present a named entity recognition (NER) system for extracting product attributes and values from listing titles. Information extraction from short listing titles present a unique challenge, with the lack of informative context and grammatical structure. In this work, we combine supervised NER with bootstrapping to expand the seed list, and output normalized results. Focusing on listings from eBay's clothing and shoes categories, our bootstrapped NER system is able to identify new brands corresponding to spelling variants and typographical errors of the known brands, as well as identifying novel brands. Among the top 300 new brands predicted, our system achieves 90.33% precision. To output normalized attribute values, we explore several string comparison algorithms and found n-gram substring matching to work well in practice.

...read moreread less

110 citations

Patent•

Interface for deploying wireline tools with non-electric string

[...]

Kevin L. Gray¹•Institutions (1)

Weatherford International¹

01 Feb 2011

TL;DR: In this article, a method of determining a free point of a tubular string stuck in a wellbore includes deploying a tool string in the stuck tubular with a non-electric string.

...read moreread less

Abstract: Embodiments of the present invention generally relate to a method and/or apparatus for deploying wireline tools with a non-electric string In one embodiment, a method of determining a free point of a tubular string stuck in a wellbore includes deploying a tool string in the stuck tubular with a non-electric string The free point assembly includes a battery, a controller, and a free point tool The method further includes activating the free point tool by the controller The free point tool contacts an inner surface of the stuck tubular string The method further includes applying a tensile force and/or torque to the stuck tubular string; and measuring a response of the tubular string with the free point tool

...read moreread less

Proceedings Article•DOI•

Impact of packet loss on CACC string stability performance

[...]

C. Lei¹, E. M. van Eenennaam¹, W. Klein Wolterink¹, Georgios Karagiannis¹, Geert Heijenk¹, Jeroen Ploeg - Show less +2 more•Institutions (1)

University of Twente¹

24 Aug 2011

TL;DR: The string stability of CACC is discussed and its performance with various packet loss ratios, beacon sending frequencies and time headway in simulations is evaluated.

...read moreread less

Abstract: Recent development in wireless technology enables communication between vehicles. The concept of Co-operative Adaptive Cruise Control (CACC) — which uses wireless communication between vehicles — aims at string stable behaviour in a platoon of vehicles. “String stability” means any non-zero position, speed, and acceleration errors of an individual vehicle in a string do not amplify when they propagate upstream. In this paper, we will discuss the string stability of CACC and evaluate its performance with various packet loss ratios, beacon sending frequencies and time headway in simulations. The simulation framework is built up with a controller prototype, a traffic simulator, and a network simulator.

...read moreread less

Journal Article•DOI•

Fully compressed suffix trees

[...]

Luís M. S. Russo¹, Gonzalo Navarro², Arlindo L. Oliveira¹•Institutions (2)

Technical University of Lisbon¹, University of Chile²

28 Sep 2011-ACM Transactions on Algorithms

TL;DR: This article introduces the first compressed suffix tree representation that requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time.

...read moreread less

Abstract: Suffix trees are by far the most important data structure in stringology, with a myriad of applications in fields like bioinformatics and information retrieval. Classical representations of suffix trees require Θ(n log n) bits of space, for a string of size n. This is considerably more than the n log2 σ bits needed for the string itself, where σ is the alphabet size. The size of suffix trees has been a barrier to their wider adoption in practice. Recent compressed suffix tree representations require just the space of the compressed string plus Θ(n) extra bits. This is already spectacular, but the linear extra bits are still unsatisfactory when σ is small as in DNA sequences. In this article, we introduce the first compressed suffix tree representation that breaks this Θ(n)-bit space barrier. The Fully Compressed Suffix Tree (FCST) representation requires only sublinear space on top of the compressed text size, and supports a wide set of navigational operations in almost logarithmic time. This includes extracting arbitrary text substrings, so the FCST replaces the text using almost the same space as the compressed text. An essential ingredient of FCSTs is the lowest common ancestor (LCA) operation. We reveal important connections between LCAs and suffix tree navigation. We also describe how to make FCSTs dynamic, that is, support updates to the text. The dynamic FCST also supports several operations. In particular, it can build the static FCST within optimal space and polylogarithmic time per symbol. Our theoretical results are also validated experimentally, showing that FCSTs are very effective in practice as well.

...read moreread less

Proceedings Article•

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

[...]

Markus Dreyer, Jason Eisner¹•Institutions (1)

Johns Hopkins University¹

27 Jul 2011

TL;DR: An inference algorithm is presented that organizes observed words (tokens) into structured inflectional paradigms (types) and naturally predicts the spelling of unobserved forms that are missing from these paradigm, and discovers inflectionAL principles (grammar) that generalize to wholly unobserved words.

...read moreread less

Abstract: We present an inference algorithm that organizes observed words (tokens) into structured inflectional paradigms (types). It also naturally predicts the spelling of unobserved forms that are missing from these paradigms, and discovers inflectional principles (grammar) that generalize to wholly unobserved words. Our Bayesian generative model of the data explicitly represents tokens, types, inflections, paradigms, and locally conditioned string edits. It assumes that inflected word tokens are generated from an infinite mixture of inflectional paradigms (string tuples). Each paradigm is sampled all at once from a graphical model, whose potential functions are weighted finite-state transducers with language-specific parameters to be learned. These assumptions naturally lead to an elegant empirical Bayes inference procedure that exploits Monte Carlo EM, belief propagation, and dynamic programming. Given 50--100 seed paradigms, adding a 10-million-word corpus reduces prediction error for morphological inflections by up to 10%.

...read moreread less

Journal Article•DOI•

Streaming transducers for algorithmic verification of single-pass list-processing programs

[...]

Rajeev Alur¹, Pavol Černý²•Institutions (2)

University of Pennsylvania¹, Institute of Science and Technology Austria²

26 Jan 2011

TL;DR: In this article, the authors introduce streaming data string transducers that map input data strings to output data strings in a single left-to-right pass in linear time, and establish PSPACE bounds for the problems of checking functional equivalence of two streaming transducers, and of checking whether a streaming transducer satisfies pre/post verification conditions specified by streaming acceptors over input/output data-strings.

...read moreread less

Abstract: We introduce streaming data string transducers that map input data strings to output data strings in a single left-to-right pass in linear time. Data strings are (unbounded) sequences of data values, tagged with symbols from a finite set, over a potentially infinite data domain that supports only the operations of equality and ordering. The transducer uses a finite set of states, a finite set of variables ranging over the data domain, and a finite set of variables ranging over data strings. At every step, it can make decisions based on the next input symbol, updating its state, remembering the input data value in its data variables, and updating data-string variables by concatenating data-string variables and new symbols formed from data variables, while avoiding duplication. We establish PSPACE bounds for the problems of checking functional equivalence of two streaming transducers, and of checking whether a streaming transducer satisfies pre/post verification conditions specified by streaming acceptors over input/output data-strings.We identify a class of imperative and a class of functional programs, manipulating lists of data items, which can be effectively translated to streaming data-string transducers. The imperative programs dynamically modify a singly-linked heap by changing next-pointers of heap-nodes and by adding new nodes. The main restriction specifies how the next-pointers can be used for traversal. We also identify an expressively equivalent fragment of functional programs that traverse a list using syntactically restricted recursive calls. Our results lead to algorithms for assertion checking and for checking functional equivalence of two programs, written possibly in different programming styles, for commonly used routines such as insert, delete, and reverse.

...read moreread less

Proceedings Article•DOI•

Random access to grammar-compressed strings

[...]

Philip Bille¹, Gad M. Landau², Rajeev Raman³, Kunihiko Sadakane⁴, Srinivasa Rao Satti⁵, Oren Weimann⁶ - Show less +2 more•Institutions (6)

Technical University of Denmark¹, University of Haifa², University of Leicester³, National Institute of Informatics⁴, Seoul National University⁵, Weizmann Institute of Science⁶

23 Jan 2011

TL;DR: In this paper, the authors presented two representations of a string of length n compressed into a context-free grammar S of size n with O(log N) random access time and O(n · αk(n)) construction time and space on the RAM.

...read moreread less

Abstract: Let S be a string of length N compressed into a context-free grammar S of size n We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM Here, αk(n) is the inverse of the kth row of Ackermann's function Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammar-compressed strings without decompression For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{|P|k, k4 +|P|} +log N) + occ), where occ is the number of occurrences of P in S Finally, we are able to generalize our results to navigation and other operations on grammar-compressed treesAll of the above bounds significantly improve the currently best known results To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two "biased" weighted ancestor data structures, and a compact representation of heavy-paths in grammars

...read moreread less

Patent•

System and method for productive generation of compound words in statistical machine translation

[...]

Nicola Cancedda¹, Sara Stymne¹•Institutions (1)

Xerox¹

25 Jul 2011

TL;DR: In this article, a method and a system for making merging decisions for a translation are disclosed which are suited to use where the target language is a productive compounding one, including outputting decisions on merging of pairs of words in a translated text string with a merging system.

...read moreread less

Abstract: A method and a system for making merging decisions for a translation are disclosed which are suited to use where the target language is a productive compounding one. The method includes outputting decisions on merging of pairs of words in a translated text string with a merging system. The merging system can include a set of stored heuristics and/or a merging model. In the case of heuristics, these can include a heuristic by which two consecutive words in the string are considered for merging if the first word of the two consecutive words is recognized as a compound modifier and their observed frequency f 1 as a closed compound word is larger than an observed frequency f 2 of the two consecutive words as a bigram. In the case of a merging model, it can be one that is trained on features associated with pairs of consecutive tokens of text strings in a training set and predetermined merging decisions for the pairs. A translation in the target language is output, based on the merging decisions for the translated text string.

...read moreread less

Patent•

Fast identification of complex strings in a data stream

[...]

Kevin Gerard Boyce

07 Jun 2011

TL;DR: In this article, a method for detecting and locating occurrence in a data stream of any complex string belonging to a predefined complex dictionary is disclosed, where a complex string may comprise an arbitrary number of interleaving coherent strings and ambiguous strings.

...read moreread less

Abstract: A method for detecting and locating occurrence in a data stream of any complex string belonging to a predefined complex dictionary is disclosed. A complex string may comprise an arbitrary number of interleaving coherent strings and ambiguous strings. The method comprises a first process for transforming the complex dictionary into a simple structure to enable continuously conducting computationally efficient search, and a second process for examining received data in real time using the simple structure. The method may be implemented as an article of manufacture having a processor-readable storage medium having instructions stored thereon for execution by a processor, causing the processor to match examined data to an object complex string belonging to the complex dictionary, where the matching process is based on equality to constituent coherent strings, and congruence to ambiguous strings, of the object complex string.

...read moreread less

Journal Article•DOI•

A Novel Error-Tolerant Anonymous Linking Code

[...]

Rainer Schnell¹, Tobias Bachteler¹, Jörg Reiher¹•Institutions (1)

University of Duisburg-Essen¹

16 Nov 2011-Social Science Research Network

TL;DR: It is claimed that the use of Bloom filters for calculating string similarities in a privacy-preserving manner can also be used for a novel error-tolerant but still irreversible encrypted key.

...read moreread less

Abstract: An anonymous linking code is an encrypted key for linking data from dierent sources. So far, quite simple algorithms for the generation of such codes based on personal characteristics as names and date of birth are in common use. These algorithms will yield many non matching codes when facing errors in the underlying indentifier values. We suggested the use of Bloom filters for calculating string similarities in a privacy-preserving manner. Here, we claim that this principle can also be used for a novel error-tolerant but still irreversible encrypted key. We call the proposed code Cryptographic Longterm Key. It consists of one single Bloom filter into which identfiers are subsequently stored. Tests on simulated databases yield linkage results comparable to non encrypted identifiers and superior to results from hitherto existing methods. Since the Cryptographic Longterm Key can be easily adapted to meet quite dierent prerequisites it might be useful for many applications.

...read moreread less

Proceedings Article•DOI•

An evaluation of automata algorithms for string analysis

[...]

Pieter Hooimeijer¹, Margus Veanes²•Institutions (2)

University of Virginia¹, Microsoft²

23 Jan 2011

TL;DR: A comprehensive set of algorithms and data structures for performing fast automata operations for string constraint solving is studied to provide an apples-to-apples comparison between techniques that are used in current tools.

...read moreread less

Abstract: There has been significant recent interest in automated reasoning techniques, in particular constraint solvers, for string variables. These techniques support a wide variety of clients, ranging from static analysis to automated testing. The majority of string constraint solvers rely on finite automata to support regular expression constraints. For these approaches, performance depends critically on fast automata operations such as intersection, complementation, and determinization. Existing work in this area has not yet provided conclusive results as to which core algorithms and data structures work best in practice.In this paper, we study a comprehensive set of algorithms and data structures for performing fast automata operations. Our goal is to provide an apples-to-apples comparison between techniques that are used in current tools. To achieve this, we re-implemented a number of existing techniques. We use an established set of regular expressions benchmarks as an indicative workload. We also include several techniques that, to the best of our knowledge, have not yet been used for string constraint solving. Our results show that there is a substantial performance difference across techniques, which has implications for future tool design.

...read moreread less

Proceedings Article•

A novel dependency-to-string model for statistical machine translation

[...]

Jun Xie, Haitao Mi, Qun Liu

27 Jul 2011

TL;DR: A source dependency structure based model that requires no heuristics or separate ordering models of the previous works to control the word order of translations and performs well on long distance reordering.

...read moreread less

Abstract: Dependency structure, as a first step towards semantics, is believed to be helpful to improve translation quality. However, previous works on dependency structure based models typically resort to insertion operations to complete translations, which make it difficult to specify ordering information in translation rules. In our model of this paper, we handle this problem by directly specifying the ordering information in head-dependents rules which represent the source side as head-dependents relations and the target side as strings. The head-dependents rules require only substitution operation, thus our model requires no heuristics or separate ordering models of the previous works to control the word order of translations. Large-scale experiments show that our model performs well on long distance reordering, and outperforms the state-of-the-art constituency-to-string model (+1.47 BLEU on average) and hierarchical phrase-based model (+0.46 BLEU on average) on two Chinese-English NIST test sets without resort to phrases or parse forest. For the first time, a source dependency structure based model catches up with and surpasses the state-of-the-art translation models.

...read moreread less

Patent•

Methods of Performing Error Detection/Correction in Nonvolatile Memory Devices

[...]

Yongjune Kim¹, Junjin Kong¹, Kyoung Lae Cho¹•Institutions (1)

Samsung¹

21 Jan 2011

TL;DR: In this article, weak column information is used to facilitate error detection and correction operations on a first plurality of bits of data read from the plurality of strings using an algorithm that modifies a weighting of the reliability of one or more data bits in the first plurality.

...read moreread less

Abstract: Methods of operating nonvolatile memory devices include testing a plurality of strings of nonvolatile memory cells in the memory device to identify at least one weak string therein having a higher probability of yielding erroneous read data error relative to other ones of the plurality of strings. An identity of the at least one weak string may be stored as weak column information. This weak column information may be used to facilitate error detection and correction operations. In particular, an error correction operation may be performed on a first plurality of bits of data read from the plurality of strings using an algorithm that modifies a weighting of the reliability of one or more data bits in the first plurality of bits of data based on the weak column information. More specifically, an algorithm may be used that interprets a bit of data read from the at least one weak string as having a relatively reduced reliability relative to other ones of the first plurality of data bits.

...read moreread less

Patent•

Severing of downhole tubing with associated cable

[...]

Michael C. Robertson, William F. Boelte

16 Sep 2011

TL;DR: In this article, the authors propose a method for severing a tubular string having a cable in association therewith, which can be performed through a single actuation of a single cutting apparatus, enabling at least a portion of the tube to be subsequently severed and retrieved.

...read moreread less

Abstract: Methods for severing a tubular string having a cable in association therewith can include lowering a cutting apparatus into the tubular string and actuating the cutting apparatus to form a cut in the tubular string and sever the cable. Severing the cable in this manner can be performed through a single actuation of a single cutting apparatus, enabling at least a portion of the tubular string to be subsequently severed and retrieved, unimpeded by the cable.

...read moreread less

Book Chapter•DOI•

Lightweight BWT construction for very large string collections

[...]

Markus J. Bauer¹, Anthony J. Cox¹, Giovanna Rosone²•Institutions (2)

Illumina¹, University of Palermo²

27 Jun 2011

TL;DR: The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m, and apply to any string collection over any alphabet.

...read moreread less

Abstract: A modern DNA sequencing machine can generate a billion or more sequence fragments in a matter of days. The many uses of the BWT in compression and indexing are well known, but the computational demands of creating the BWT of datasets this large have prevented its applications from being widely explored in this context. We address this obstacle by presenting two algorithms capable of computing the BWT of very large string collections. The algorithms are lightweight in that the first needs O(m log m) bits of memory to process m strings and the memory requirements of the second are constant with respect to m. We evaluate our algorithms on collections of up to 1 billion strings and compare their performance to other approaches on smaller datasets. Although our tests were on collections of DNA sequences of uniform length, the algorithms themselves apply to any string collection over any alphabet.

...read moreread less

Proceedings Article•DOI•

Learning bilingual lexicons using the visual similarity of labeled web images

[...]

Shane Bergsma¹, Benjamin Van Durme¹•Institutions (1)

Johns Hopkins University¹

16 Jul 2011

TL;DR: This work generates bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects, and uses these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word- to-word translations.

...read moreread less

Abstract: Speakers of many different languages use the Internet. A common activity among these users is uploading images and associating these images with words (in their own language) as captions, filenames, or surrounding text. We use these explicit, monolingual, image-to-word connections to successfully learn implicit, bilingual, word-to-word translations. Bilingual pairs of words are proposed as translations if their corresponding images have similar visual features. We generate bilingual lexicons in 15 language pairs, focusing on words that have been automatically identified as physical objects. The use of visual similarity substantially improves performance over standard approaches based on string similarity: for generated lexicons with 1000 translations, including visual information leads to an absolute improvement in accuracy of 8-12% over string edit distance alone.

...read moreread less

Patent•

Secure and automated credential information transfer mechanism

[...]

Sean Stauth¹, Sewook Wee¹•Institutions (1)

Accenture¹

04 Feb 2011

TL;DR: In this paper, a mechanism for securely transmitting credentials to instantiated virtual machines is provided, where a central server is used to turn on a virtual machine and send it a secret text string.

...read moreread less

Abstract: A mechanism for securely transmitting credentials to instantiated virtual machines is provided. A central server is used to turn on a virtual machine. When the virtual machine is turned on, the central server sends it a secret text string. The virtual machine requests the credentials from the central server by transmitting the secret string and its instance ID. The central server validates the secret string and source IP to determine whether they are authentic. Once verified, the central server transmits the credentials to the virtual machine in a secure channel and invalidates the secret string. The credentials can now be used to authenticate API calls.

...read moreread less

Journal Article•DOI•

Sparse RNA folding: Time and space efficient algorithms

[...]

Rolf Backofen¹, Dekel Tsur², Shay Zakov², Michal Ziv-Ukelson²•Institutions (2)

University of Freiburg¹, Ben-Gurion University of the Negev²

01 Mar 2011-Journal of Discrete Algorithms

TL;DR: The currently fastest algorithm for RNA Single Strand Folding requires O(nZ) time and @Q(n^2) space, where n denotes the length of the input string and Z is a sparsity parameter satisfying n=Z.

...read moreread less

Posted Content•

A Context-theoretic Framework for Compositionality in Distributional Semantics

[...]

Daoud Clarke¹•Institutions (1)

University of Hertfordshire¹

24 Jan 2011-arXiv: Computation and Language

TL;DR: Approaches to the task of recognizing textual entailment, including the use of subsequence matching, lexical entailment probability, and latent Dirichlet allocation, can be described within this framework.

...read moreread less

Abstract: Techniques in which words are represented as vectors have proved useful in many applications in computational linguistics, however there is currently no general semantic formalism for representing meaning in terms of vectors. We present a framework for natural language semantics in which words, phrases and sentences are all represented as vectors, based on a theoretical analysis which assumes that meaning is determined by context. In the theoretical analysis, we define a corpus model as a mathematical abstraction of a text corpus. The meaning of a string of words is assumed to be a vector representing the contexts in which it occurs in the corpus model. Based on this assumption, we can show that the vector representations of words can be considered as elements of an algebra over a field. We note that in applications of vector spaces to representing meanings of words there is an underlying lattice structure; we interpret the partial ordering of the lattice as describing entailment between meanings. We also define the context-theoretic probability of a string, and, based on this and the lattice structure, a degree of entailment between strings. We relate the framework to existing methods of composing vector-based representations of meaning, and show that our approach generalises many of these, including vector addition, component-wise multiplication, and the tensor product.

...read moreread less

Proceedings Article•DOI•

String-based audiovisual fusion of behavioural events for the assessment of dimensional affect

[...]

Florian Eyben¹, Martin Wöllmer¹, Michel Valstar², Hatice Gunes², Björn Schuller¹, Maja Pantic² - Show less +2 more•Institutions (2)

Technische Universität München¹, Imperial College London²

21 Mar 2011

TL;DR: The experimental results show that the proposed string-based approach is the best performing approach for automatic prediction of Valence and Expectation dimensions, and improves prediction performance for the other dimensions when combined with at least acoustic signal-based features.

...read moreread less

Abstract: The automatic assessment of affect is mostly based on feature-level approaches, such as distances between facial points or prosodic and spectral information when it comes to audiovisual analysis. However, it is known and intuitive that behavioural events such as smiles, head shakes or laughter and sighs also bear highly relevant information regarding a subject's affective display. Accordingly, we propose a novel string-based prediction approach to fuse such events and to predict human affect in a continuous dimensional space. Extensive analysis and evaluation has been conducted using the newly released SEMAINE database of human-to-agent communication. For a thorough understanding of the obtained results, we provide additional benchmarks by more conventional feature-level modelling, and compare these and the string-based approach to fusion of signal-based features and string-based events. Our experimental results show that the proposed string-based approach is the best performing approach for automatic prediction of Valence and Expectation dimensions, and improves prediction performance for the other dimensions when combined with at least acoustic signal-based features.

...read moreread less

Book Chapter•DOI•

Non-interactive and re-usable universally composable string commitments with adaptive security

[...]

Marc Fischlin¹, Benoît Libert², Mark Manulis¹•Institutions (2)

Technische Universität Darmstadt¹, Université catholique de Louvain²

04 Dec 2011

TL;DR: These are the first provably secure constructions of universally composable (UC) commitments (in pairing-friendly groups) that simultaneously combine the key properties of being non-interactive, supporting commitments to strings, and offering re-usability of the common reference string for multiple commitments.

...read moreread less

Abstract: We present the first provably secure constructions of universally composable (UC) commitments (in pairing-friendly groups) that simultaneously combine the key properties of being non-interactive, supporting commitments to strings (instead of bits only), and offering re-usability of the common reference string for multiple commitments. Our schemes are also adaptively secure assuming reliable erasures.

...read moreread less

Collapse