scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2012"


Journal ArticleDOI
TL;DR: The update to version 9.1 of STRING is described, introducing several improvements, including extending the automated mining of scientific texts for interaction information, to now also include full-text articles, and providing users with statistical information on any functional enrichment observed in their networks.
Abstract: Complete knowledge of all direct and indirect interactions between proteins in a given cell would represent an important milestone towards a comprehensive description of cellular mechanisms and functions. Although this goal is still elusive, considerable progress has been made-particularly for certain model organisms and functional systems. Currently, protein interactions and associations are annotated at various levels of detail in online resources, ranging from raw data repositories to highly formalized pathway databases. For many applications, a global view of all the available interaction data is desirable, including lower-quality data and/or computational predictions. The STRING database (http://string-db.org/) aims to provide such a global perspective for as many organisms as feasible. Known and predicted associations are scored and integrated, resulting in comprehensive protein networks covering >1100 organisms. Here, we describe the update to version 9.1 of STRING, introducing several improvements: (i) we extend the automated mining of scientific texts for interaction information, to now also include full-text articles; (ii) we entirely re-designed the algorithm for transferring interactions from one model organism to the other; and (iii) we provide users with statistical information on any functional enrichment observed in their networks.

3,900 citations


Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of distributed control of a platoon of vehicles with nonlinear dynamics and derive sufficient conditions that guarantee asymptotic stability and string stability.
Abstract: This paper considers the problem of distributed control of a platoon of vehicles with nonlinear dynamics. We present distributed receding horizon control algorithms and derive sufficient conditions that guarantee asymptotic stability, leader-follower string stability, and predecessor-follower string stability, following a step speed change in the platoon. Vehicles compute their own control in parallel, and receive communicated position and velocity error trajectories from their immediate predecessor. Leader-follower string stability requires additional communication from the lead car at each update, in the form of a position error trajectory. Predecessor-follower string stability, as we define it, implies leader-follower string stability. Predecessor-follower string stability requires stricter constraints in the local optimal control problems than the leader-follower formulation, but communication from the lead car is required only once at initialization. Provided an initially feasible solution can be found, subsequent feasibility of the algorithms are guaranteed at every update. The theory is generalized for nonlinear decoupled dynamics, and is thus applicable to fleets of planes, robots, or boats, in addition to cars. A simple seven-car simulation examines parametric tradeoffs that affect stability and string stability. Analysis on platoon formation, heterogeneity and size (length) is also considered, resulting in intuitive tradeoffs between lead car and following car control flexibility.

357 citations


Patent
14 Aug 2012
TL;DR: In this article, a semiconductor memory device is provided including first and second cell strings formed on a substrate, the first cell strings jointly connected to a bit line, and the second string selection unit of the second cell string has a channel dopant region.
Abstract: A semiconductor memory device is provided including first and second cell strings formed on a substrate, the first and second cell strings jointly connected to a bit line, wherein each of the first and second cell strings includes a ground selection unit, a memory cell, and first and second string selection units sequentially formed on the substrate to be connected to each other, wherein the ground selection unit is connected to a ground selection line, the memory cell is connected to a word line, the first string selection unit is connected to a first string selection line, and the second string selection unit is connected to a second string selection line, and wherein the second string selection unit of the first cell string has a channel dopant region.

330 citations


Proceedings Article
12 Jul 2012
TL;DR: This paper proposes a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution and shows that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset.
Abstract: Microblog normalisation methods often utilise complex models and struggle to differentiate between correctly-spelled unknown words and lexical variants of known words In this paper, we propose a method for constructing a dictionary of lexical variants of known words that facilitates lexical normalisation via simple string substitution (eg tomorrow for tmrw) We use context information to generate possible variant and normalisation pairs and then rank these by string similarity Highly-ranked pairs are selected to populate the dictionary We show that a dictionary-based approach achieves state-of-the-art performance for both F-score and word error rate on a standard dataset Compared with other methods, this approach offers a fast, lightweight and easy-to-use solution, and is thus suitable for high-volume microblog pre-processing

203 citations


Journal ArticleDOI
TL;DR: The string stability of CACC is discussed and its performance under varying packet loss ratios, beacon sending frequencies, and time headway settings in simulation experiments is evaluated.
Abstract: Recent development in wireless technology enables communication between vehicles. The concept of cooperative adaptive cruise control (CACC)–which uses wireless communication between vehicles–aims at string stable behavior in a platoon of vehicles. “String stability” means any non-zero position, speed, and acceleration errors of an individual vehicle in a string do not amplify when they propagate upstream. In this article, we will discuss the string stability of CACC and evaluate its performance under varying packet loss ratios, beacon sending frequencies, and time headway settings in simulation experiments. The simulation framework is built up with a controller prototype, a traffic simulator, and a network simulator.

189 citations


Book ChapterDOI
19 Aug 2012
TL;DR: In this paper, the authors proposed a functional encryption system that supports functionality for regular languages, where a secret key is associated with a Deterministic Finite Automata (DFA) M. A ciphertext is encrypted and associated with an arbitrary length string w. A user is able to decrypt the ciphertext if and only if the DFA M associated with his private key accepts the string w w.
Abstract: We provide a functional encryption system that supports functionality for regular languages. In our system a secret key is associated with a Deterministic Finite Automata (DFA) M. A ciphertext \(\text {CT}\) encrypts a message m and is associated with an arbitrary length string w. A user is able to decrypt the ciphertext \(\text {CT}\) if and only if the DFA M associated with his private key accepts the string w.

166 citations


Journal ArticleDOI
TL;DR: In this article, robust adaptive boundary control is developed for a class of flexible string-type systems under unknown time-varying disturbance, where the dynamics of the string system is represented by a nonhomogeneous hyperbolic partial differential equation (PDE) and two ordinary differential equations.
Abstract: In this paper, robust adaptive boundary control is developed for a class of flexible string-type systems under unknown time-varying disturbance. The dynamics of the string system is represented by a nonhomogeneous hyperbolic partial differential equation (PDE) and two ordinary differential equations. Boundary control is proposed at the right boundary of the string based on the original distributed parameter system model (PDE) to suppress the vibration excited by the external unknown disturbance. Adaptive control is designed to compensate the system parametric uncertainty. With the proposed robust adaptive boundary control, all the signals in the closed-loop system are guaranteed to be uniformly ultimately bounded. The state of the string system is proven to converge to a small neighborhood of zero by appropriately choosing design parameters. Simulations are provided to illustrate the effectiveness of the proposed control.

151 citations


Journal ArticleDOI
TL;DR: The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons and demonstrates that the framework outperforms state-of-the-art localization algorithms.
Abstract: In this paper, we propose a novel framework to extract text regions from scene images with complex backgrounds and multiple text appearances. This framework consists of three main steps: boundary clustering (BC), stroke segmentation, and string fragment classification. In BC, we propose a new bigram-color-uniformity-based method to model both text and attachment surface, and cluster edge pixels based on color pairs and spatial positions into boundary layers. Then, stroke segmentation is performed at each boundary layer by color assignment to extract character candidates. We propose two algorithms to combine the structural analysis of text stroke with color assignment and filter out background interferences. Further, we design a robust string fragment classification based on Gabor-based text features. The features are obtained from feature maps of gradient, stroke distribution, and stroke width. The proposed framework of text localization is evaluated on scene images, born-digital images, broadcast video images, and images of handheld objects captured by blind persons. Experimental results on respective datasets demonstrate that the framework outperforms state-of-the-art localization algorithms.

135 citations


Posted Content
TL;DR: This article proposed a discriminative string-edit CRF, a conditional random field model for edit sequences between strings, which is trained on both positive and negative instances of string pairs.
Abstract: The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finitestate conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets.

133 citations


Patent
07 Dec 2012
TL;DR: In this article, a machine translation method includes receiving a source text string and identifying any named entities, and the identified named entities may be processed to exclude common nouns and function words, based on the extracted features, a protocol is selected for translating the source text text.
Abstract: A machine translation method includes receiving a source text string and identifying any named entities. The identified named entities may be processed to exclude common nouns and function words. Features are extracted from the source text string relating to the identified named entities. Based on the extracted features, a protocol is selected for translating the source text string. A first translation protocol includes forming a reduced source string from the source text string in which the named entity is replaced by a placeholder, translating the reduced source string by machine translation to generate a translated reduced target string, while processing the named entity separately to be incorporated into the translated reduced target string. A second translation protocol includes translating the source text string by machine translation, without replacing the named entity with the placeholder. The target text string produced by the selected protocol is output.

121 citations


Book ChapterDOI
07 Jul 2012
TL;DR: A framework that can learn number transformations from very few input-output examples is presented, and an inductive synthesis algorithm for manipulating data types that have numbers as a constituent sub-type such as date, unit, and time is obtained.
Abstract: Numbers are one of the most widely used data type in programming languages. Number transformations like formatting and rounding present a challenge even for experienced programmers as they find it difficult to remember different number format strings supported by different programming languages. These transformations present an even bigger challenge for end-users of spreadsheet systems like Microsoft Excel where providing such custom format strings is beyond their expertise. In our extensive case study of help forums of many programming languages and Excel, we found that both programmers and end-users struggle with these number transformations, but are able to easily express their intent using input-output examples. In this paper, we present a framework that can learn such number transformations from very few input-output examples. We first describe an expressive number transformation language that can model these transformations, and then present an inductive synthesis algorithm that can learn all expressions in this language that are consistent with a given set of examples. We also present a ranking scheme of these expressions that enables efficient learning of the desired transformation from very few examples. By combining our inductive synthesis algorithm for number transformations with an inductive synthesis algorithm for syntactic string transformations, we are able to obtain an inductive synthesis algorithm for manipulating data types that have numbers as a constituent sub-type such as date, unit, and time. We have implemented our algorithms as an Excel add-in and have evaluated it successfully over several benchmarks obtained from the help forums and the Excel product team.

Proceedings ArticleDOI
02 Jun 2012
TL;DR: It is observed that malformed HTML is often produced by incorrect constant prints, i.e., statements that print string literals, and two tools for automatically repairing such HTML generation errors are presented.
Abstract: PHP web applications routinely generate invalid HTML. Modern browsers silently correct HTML errors, but sometimes malformed pages render inconsistently, cause browser crashes, or expose security vulnerabilities. Fixing errors in generated pages is usually straightforward, but repairing the generating PHP program can be much harder. We observe that malformed HTML is often produced by incorrect "constant prints", i.e., statements that print string literals, and present two tools for automatically repairing such HTML generation errors. PHPQuickFix repairs simple bugs by statically analyzing individual prints. PHPRepair handles more general repairs using a dynamic approach. Based on a test suite, the property that all tests should produce their expected output is encoded as a string constraint over variables representing constant prints. Solving this constraint describes how constant prints must be modified to make all tests pass. Both tools were implemented as an Eclipse plugin and evaluated on PHP programs containing hundreds of HTML generation errors, most of which our tools were able to repair automatically.

Posted Content
TL;DR: An expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations is described and a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples is presented.
Abstract: We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks. We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums.

Proceedings ArticleDOI
01 Dec 2012
TL;DR: Conditions on the uncertain sampling intervals and delays under which string stability can still be guaranteed are provided to support the design of CACC systems that are robust to uncertainties introduced by wireless communication.
Abstract: In this paper, we present a novel modelling and string stability analysis method for an interconnected vehicle string in which information exchange takes place via wireless communication. The usage of wireless communication introduces time-varying sampling intervals, delays, and communication constraints of which the impact on string stability requires a careful analysis. In particular, we study a Cooperative Adaptive Cruise Control (CACC) system which regulates inter-vehicle distances in a vehicle string and utilizes information exchange between vehicles through wireless communication in addition to local sensor measurements. The propagation of disturbances through the interconnected vehicle string is inspected by using the notion of so-called string stability which is formulated here in terms of an ℒ 2 -gain requirement from disturbance inputs to controlled outputs. This paper provides conditions on the uncertain sampling intervals and delays under which string stability can still be guaranteed. These results support the design of CACC systems that are robust to uncertainties introduced by wireless communication.

Journal ArticleDOI
01 Apr 2012
TL;DR: In this article, the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc., is addressed.
Abstract: We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks.We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums.

Journal ArticleDOI
TL;DR: A simple diagnostic method to determine the number of open and short circuited PV modules in a string of a PV system by taking into account the economical factor, such as minimum number of sensors, has been proposed.

Journal ArticleDOI
01 Mar 2012
TL;DR: The obtained results indicate that prioritisation based on string distances is more efficient in finding defects than random ordering of the test suite: the test suites prioritized using string distances are moreefficient in detecting the strongest mutants, and, on average, have a better APFD than randomly ordered test suites.
Abstract: Test case prioritisation aims at finding an ordering which enhances a certain property of an ordered test suite. Traditional techniques rely on the availability of code or a specification of the program under test. We propose to use string distances on the text of test cases for their comparison and elaborate a prioritisation algorithm. Such a prioritisation does not require code or a specification and can be useful for initial testing and in cases when code is difficult to instrument. In this paper, we also report on experiments performed on the "Siemens Test Suite", where the proposed prioritisation technique was compared with random permutations and four classical string distance metrics were evaluated. The obtained results, confirmed by a statistical analysis, indicate that prioritisation based on string distances is more efficient in finding defects than random ordering of the test suite: the test suites prioritized using string distances are more efficient in detecting the strongest mutants, and, on average, have a better APFD than randomly ordered test suites. The results suggest that string distances can be used for prioritisation purposes, and Manhattan distance could be the best choice.

Patent
16 Mar 2012
TL;DR: In this paper, the authors present a system for predicting user input on a keyboard consisting of at least three fields: the first field displays an input string that is based on input selections such as keyboard entries, the second field displays a candidate prediction generated based on other input selections, consisting at least in part of a proposed completion to the input selection, and partially based on the input string in the first fields.
Abstract: Methods and systems for predicting user input on a keyboard. Methods include enabling user input on a display comprising at least three fields. The first field displays an input string that is based on input selections such as keyboard entries. The second field displays a candidate prediction generated based on other input selections, consisting at least in part of a proposed completion to the input selection, and partially based on the input string in the first field. The third field displays another candidate prediction generated based on the input string in the first field as well as the candidate prediction in the second field.

Patent
28 Aug 2012
TL;DR: In this paper, a translation method is adapted to a domain of interest by generating a set of candidate translations of the source text string, each candidate translation comprising a sequence of target words in a target language.
Abstract: A translation method is adapted to a domain of interest. The method includes receiving a source text string comprising a sequence of source words in a source language and generating a set of candidate translations of the source text string, each candidate translation comprising a sequence of target words in a target language. An optimal translation is identified from the set of candidate translations as a function of at least one domain-adapted feature computed based on bilingual probabilities and monolingual probabilities. Each bilingual probability is for a source text fragment and a target text fragment of the source text string and candidate translation respectively. The bilingual probabilities are estimated on an out-of-domain parallel corpus that includes source and target strings. The monolingual probabilities for text fragments of one of the source text string and candidate translation are estimated on an in-domain monolingual corpus.

Journal ArticleDOI
TL;DR: Two novel algorithms for the case where the text is fixed and many queries arrive over time are presented, both using an O(n) size data structure, each of which can be constructed in O( n) time.
Abstract: The Parikh vector p(s) of a string s over a finite ordered alphabet Σ = {a1, …, aσ} is defined as the vector of multiplicities of the characters, p(s) = (p1, …, pσ), where pi = |{j | sj = ai}|. Parikh vector q occurs in s if s has a substring t with p(t) = q. The problem of searching for a query q in a text s of length n can be solved simply and worst-case optimally with a sliding window approach in O(n) time. We present two novel algorithms for the case where the text is fixed and many queries arrive over time. The first algorithm only decides whether a given Parikh vector appears in a binary text. It uses a linear size data structure and decides each query in O(1) time. The preprocessing can be done trivially in Θ(n2) time. The second algorithm finds all occurrences of a given Parikh vector in a text over an arbitrary alphabet of size σ ≥ 2 and has sub-linear expected time complexity. More precisely, we present two variants of the algorithm, both using an O(n) size data structure, each of which can be constructed in O(n) time. The first solution is very simple and easy to implement and leads to an expected query time of , where m = ∑i qi is the length of a string with Parikh vector q. The second uses wavelet trees and improves the expected runtime to , i.e., by a factor of log m.

Book ChapterDOI
05 Mar 2012
TL;DR: In this paper, a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases is presented, which can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O(m2 + (m + occ) log log n) time.
Abstract: To store and search genomic databases efficiently, researchers have recently started building compressed self-indexes based on straight-line programs and LZ77. In this paper we show how, given a balanced straight-line program for a string S[1..n] whose LZ77 parse consists of z phrases, we can add O(z log log z) words and obtain a compressed self-index for S such that, given a pattern P [1..m], we can list the occ occurrences of P in S in O(m2 + (m + occ) log log n) time. All previous self-indexes are either larger or slower in the worst case.

Journal ArticleDOI
TL;DR: In this article, a nonissipative string current diverter is proposed to overcome the problem of inhomogeneous irradiation in photovoltaic (PV) power generation system.
Abstract: Frequently considered one of the promising solutions for grid connection of the photovoltaic (PV) power generation system, module-integrated converters have been the focus of numerous papers. Most of the proposed approaches thus far have relied on the use of series string of the dc-dc converter to create a high-voltage string connected to the dc-ac inverter. The boost converter is better in this application. However, under inhomogeneous irradiation, the power generated by each PV module and the output dc voltage of each boost become unbalanced so that the output currents of each boost are balanced and equal to the string current. In this case, the boost converter cannot always deliver all the power from a mixture of shaded panels and those delivering full power. In this paper, a nondissipative string current diverter is proposed to overcome this problem. One important feature of the proposed circuit herein is the ability to effectively decouple each converter from the rest of the string, making it insensitive to change in the string current. Hence, it is possible to obtain the maximum power from the PV module with the maximum power point tracking algorithm implemented on each dc-dc converter and to do so at the optimum efficiency. The simulation and experimental results verify that the proposed topology exhibits notable performances despite inhomogeneous irradiation. On the other hand, the string current diverter circuit is very easy to control and does not operate without inhomogeneous irradiation, so the topology efficiency is improved for any type of irradiation.

Patent
Te-Pei Tseng1, Kun-Da Wu1
02 May 2012
TL;DR: In this article, a method for calibrating an input of webpage address used in a handheld electronic device is provided, which comprises a touch display unit, a storage unit for storing a plurality website address data and a processing unit being electrically connected to the touch display units and the storage unit.
Abstract: A method for calibrating an input of webpage address used in a handheld electronic device is provided. The handheld electronic device comprises a touch display unit, a storage unit for storing a plurality website address data and a processing unit being electrically connected to the touch display unit and the storage unit. The method comprises the steps outlined in the sentences that follow. At least one character is received from the touch display unit, wherein each of the character has a plurality neighboring characters on a keyboard. A plurality of string combinations are generated by the processing unit according to the neighboring characters. The storage unit is searched by the processing unit according to the string combinations to generate an address suggestion list. A handheld electronic device is disclosed herein as well.

Proceedings ArticleDOI
01 Jan 2012
TL;DR: PermA, a general-purpose string aligner, and Balloon, a text processing toolkit for German and English providing components for part-of-speech tagging, morphological analyses, and grapheme-to-phoneme conversion including syllabifica- tion and word-stress assignment are introduced.
Abstract: Two online research tools are presented in this paper: PermA, a general-purpose string aligner which can for example be used for grapheme-to-phoneme and phoneme-to-phoneme alignment, and Balloon, a text processing toolkit for German and English providing components for part-of-speech tagging, morphological analyses, and grapheme-to-phoneme conversion including syllabifica- tion and word-stress assignment. The general architectures of these tools are introduced with a focus on recent improvements concerning the alignment cost function derivation and word stress assignment.

Journal ArticleDOI
TL;DR: In this paper, the minimum-makespan supervisor synthesis problem is solved by a terminable algorithm, where the execution time of each string is computable by the theory of heaps-of-pieces.
Abstract: In many practical applications, we need to compute a nonblocking supervisor that not only complies with pre-specified safety requirements but also achieves a certain time optimal performance such as maximum throughput. In this paper, we first present a minimum-makespan supervisor synthesis problem. Then we show that the problem can be solved by a terminable algorithm, where the execution time of each string is computable by the theory of heaps-of-pieces. We also provide a timed supervisory control map that can implement the synthesized minimum-makespan sublanguage.

Journal Article
TL;DR: The introduction of a discretised version of a string diagram called a string graph is introduced, and it is shown how string graphs modulo a rewrite system can be used to construct free symmetric traced and compact closed categories on a monoidal signature.
Abstract: This work is about diagrammatic languages, how they can be represented, and what they in turn can be used to represent. More specifically, it focuses on representations and applications of string diagrams. String diagrams are used to represent a collection of processes, depicted as "boxes" with multiple (typed) inputs and outputs, depicted as "wires". If we allow plugging input and output wires together, we can intuitively represent complex compositions of processes, formalised as morphisms in a monoidal category. [...] The first major contribution of this dissertation is the introduction of a discretised version of a string diagram called a string graph. String graphs form a partial adhesive category, so they can be manipulated using double-pushout graph rewriting. Furthermore, we show how string graphs modulo a rewrite system can be used to construct free symmetric traced and compact closed categories on a monoidal signature. The second contribution is in the application of graphical languages to quantum information theory. We use a mixture of diagrammatic and algebraic techniques to prove a new classification result for strongly complementary observables. [...] We also introduce a graphical language for multipartite entanglement and illustrate a simple graphical axiom that distinguishes the two maximally-entangled tripartite qubit states: GHZ and W. [...] The third contribution is a description of two software tools developed in part by the author to implement much of the theoretical content described here. The first tool is Quantomatic, a desktop application for building string graphs and graphical theories, as well as performing automated graph rewriting visually. The second is QuantoCoSy, which performs fully automated, model-driven theory creation using a procedure called conjecture synthesis.

Journal ArticleDOI
01 Aug 2012
TL;DR: This paper designs efficient trie-join algorithms and pruning techniques to achieve high performance and shows that these algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.
Abstract: A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string similarity joins with edit-distance constraints. Existing methods usually employ a filter-and-refine framework and suffer from the following limitations: (1) They are inefficient for the data sets with short strings (the average string length is not larger than 30); (2) They involve large indexes; (3) They are expensive to support dynamic update of data sets. To address these problems, we propose a novel method called trie-join, which can generate results efficiently with small indexes. We use a trie structure to index the strings and utilize the trie structure to efficiently find similar string pairs based on subtrie pruning. We devise efficient trie-join algorithms and pruning techniques to achieve high performance. Our method can be easily extended to support dynamic update of data sets efficiently. We conducted extensive experiments on four real data sets. Experimental results show that our algorithms outperform state-of-the-art methods by an order of magnitude on the data sets with short strings.

Proceedings ArticleDOI
17 Apr 2012
TL;DR: This paper presents an approach in which examples of inputs are sought from the Internet by reformulating program identifiers into web queries, and used to augment and seed a search-based test data generation technique.
Abstract: Generating realistic, branch-covering string inputs is a challenging problem, due to the diverse and complex types of real-world data that are naturally encodable as strings, for example resource locators, dates of different localised formats, international banking codes, and national identity numbers. This paper presents an approach in which examples of inputs are sought from the Internet by reformulating program identifiers into web queries. The resultant URLs are downloaded, split into tokens, and used to augment and seed a search-based test data generation technique. The use of the Internet as part of test input generation has two key advantages. Firstly, web pages are a rich source of valid inputs for various types of string data that may be used to improve test coverage. Secondly, the web pages tend to contain realistic, human-readable values, which are invaluable when test cases need manual confirmation due to the lack of an automated oracle. An empirical evaluation of the approach is presented, involving string input validation code from 10 open source projects. Well-formed, valid string inputs were retrieved from the web for 96% of the different string types analysed. Using the approach, coverage was improved for 75% of the Java classes studied by an average increase of 14%.

Patent
30 Aug 2012
TL;DR: In this paper, a network on chip processor including multiple cores and a Kautz NoC is presented, where each of the cores is assigned with an addressing string with L based-D words, and the addressing string does not have two neighboring identical words.
Abstract: An exemplary embodiment of the present disclosure illustrates a network on chip processor including multiple cores and a Kautz NoC. Each of the cores is assigned with an addressing string with L based-D words, and the addressing string does not have two neighboring identical words, wherein L present of an addressing string length is an integer larger than 1, D present of a word selection is an integer larger than 2. Each of the cores is unidirectionally link to other (D−1) cores through the Kautz NoC, and in the two connected cores, the last (L−1) words associated with the addressing string of one core are same as the first (L−1) words associated with the addressing string of the other core.

Journal ArticleDOI
TL;DR: It is proved that k-SPC is NP-complete, the problem of the set of parameterized k-covers which combines k-cover measure with parameterized matching, which is a distance measure for strings.