scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2008"


Journal ArticleDOI
TL;DR: The tm package is presented which provides a framework for text mining applications within R and techniques for count-based analysis methods, text clustering, text classification and string kernels are presented.
Abstract: During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

1,057 citations


Book ChapterDOI
17 Aug 2008
TL;DR: A simple and efficient compiler is presented for transforming secure multi-party computation protocols that enjoy security only with an honest majority into MPC protocols that guarantee security with no honest majority, in the oblivious-transfer (OT) hybrid model.
Abstract: We present a simple and efficient compiler for transforming secure multi-party computation (MPC) protocols that enjoy security only with an honest majority into MPC protocols that guarantee security with no honest majority, in the oblivious-transfer (OT) hybrid model. Our technique works by combining a secure protocol in the honest majority setting with a protocol achieving only security against semi-honestparties in the setting of no honest majority. Applying our compiler to variants of protocols from the literature, we get several applications for secure two-party computation and for MPC with no honest majority. These include: Constant-rate two-party computation in the OT-hybrid model. We obtain a statistically UC-secure two-party protocol in the OT-hybrid model that can evaluate a general circuit Cof size sand depth dwith a total communication complexity of O(s) + poly(k, d, log s) and O(d) rounds. The above result generalizes to a constant number of parties. Extending OTs in the malicious model. We obtain a computationally efficient protocol for generating many string OTs from few string OTs with only a constant amortized communication overheadcompared to the total length of the string OTs. Black-box constructions for constant-round MPC with no honest majority. We obtain general computationally UC-secure MPC protocols in the OT-hybrid model that use only a constant number of rounds, and only make a black-boxaccess to a pseudorandom generator. This gives the first constant-round protocols for three or more parties that only make a black-box use of cryptographic primitives (and avoid expensive zero-knowledge proofs).

635 citations


Proceedings Article
01 May 2008
TL;DR: Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts.
Abstract: We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

339 citations


Journal ArticleDOI
TL;DR: In this article, an approach to find transition pathways in complex systems is presented, which consists of refining a putative transition path in the multidimensional space supported by a set of collective variables using the average dynamic drift of those variables.
Abstract: An approach to find transition pathways in complex systems is presented. The method, which is related to the string method in collective variables of Maragliano et al. (J. Chem. Phys. 2006, 125, 024106), is conceptually simple and straightforward to implement. It consists of refining a putative transition path in the multidimensional space supported by a set of collective variables using the average dynamic drift of those variables. This drift is estimated on-the-fly via swarms of short unbiased trajectories started at different points along the path. Successive iterations of this algorithm, which can be naturally distributed over many computer nodes with negligible interprocessor communication, refine an initial trial path toward the most probable transition path (MPTP) between two stable basins. The method is first tested by determining the pathway for the C7eq to C7ax transition in an all-atom model of the alanine dipeptide in vacuum, which has been studied previously with the string method in collecti...

306 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: This paper develops several algorithms that can greatly improve the performance of existing algorithms and studies how to integrate existing filtering techniques with these algorithms, and shows that they should be used together judiciously.
Abstract: We study the following problem: how to efficiently find in a collection of strings those similar to a given query string? Various similarity functions can be used, such as edit distance, Jaccard similarity, and cosine similarity. This problem is of great interests to a variety of applications that need a high real-time performance, such as data cleaning, query relaxation, and spellchecking. Several algorithms have been proposed based on the idea of merging inverted lists of grams generated from the strings. In this paper we make two contributions. First, we develop several algorithms that can greatly improve the performance of existing algorithms. Second, we study how to integrate existing filtering techniques with these algorithms, and show that they should be used together judiciously, since the way to do the integration can greatly affect the performance. We have conducted experiments on several real data sets to evaluate the proposed techniques.

296 citations


Patent
Gary Wassermann1, Dachuan Yu1, Ajay Chandler1, Dinakar Dhurjati1, Hiroshi Inamura1 
03 Nov 2008
TL;DR: In this paper, a method and apparatus for automated test input generation for web applications is described, which comprises performing a source-to-source transformation of the program, performing interpretation on the program based on a set of test input values; symbolically executing the program; recording a symbolic constraint for each of one or more conditional expressions encountered during execution, including analyzing a string operation in the program to identify possible execution paths, and generating symbolic inputs representing values of variables in each of the conditional expressions as a numeric expression and a string constraint including generating constraints on string values.
Abstract: A method and apparatus is disclosed herein for automated test input generation for web applications. In one embodiment, the method comprises performing a source-to-source transformation of the program; performing interpretation on the program based on a set of test input values; symbolically executing the program; recording a symbolic constraint for each of one or more conditional expressions encountered during execution of the program, including analyzing a string operation in the program to identify one or more possible execution paths, and generating symbolic inputs representing values of variables in each of the conditional expressions as a numeric expression and a string constraint including generating constraints on string values by modeling string operations using finite state transducers (FSTs) and supplying values from the program's execution in place of intractable sub-expressions; and generating new inputs to drive the program during a subsequent iteration based on results of solving the recorded string constraints.

231 citations


Proceedings ArticleDOI
20 Jul 2008
TL;DR: An automated input test generation algorithm that uses runtime values to analyze dynamic code, models the semantics of string operations, and handles operations whose argument and return values may not share a common type is proposed.
Abstract: Web applications routinely handle sensitive data, and many people rely on them to support various daily activities, so errors can have severe and broad-reaching consequences. Unlike most desktop applications, many web applications are written in scripting languages, such as PHP. The dynamic features commonly supported by these languages significantly inhibit static analysis and existing static analysis of these languages can fail to produce meaningful results on realworld web applications.Automated test input generation using the concolic testing framework has proven useful for finding bugs and improving test coverage on C and Java programs, which generally emphasize numeric values and pointer-based data structures. However, scripting languages, such as PHP, promote a style of programming for developing web applications that emphasizes string values, objects, and arrays.In this paper, we propose an automated input test generation algorithm that uses runtime values to analyze dynamic code, models the semantics of string operations, and handles operations whose argument and return values may not share a common type. As in the standard concolic testing framework, our algorithm gathers constraints during symbolic execution. Our algorithm resolves constraints over multiple types by considering each variable instance individually, so that it only needs to invert each operation. By recording constraints selectively, our implementation successfully finds bugs in real-world web applications which state-of-the-art static analysis tools fail to analyze.

203 citations


Patent
04 Mar 2008
TL;DR: In this paper, a virtual keyboard is presented in a first region of a touch sensitive display of a device and an input representing a phonetic string is received on the virtual keyboard.
Abstract: Methods, systems, and apparatus, including computer program products, for inputting text. A virtual keyboard is presented in a first region of a touch sensitive display of a device. An input representing a phonetic string is received on the virtual keyboard. The entered phonetic string is presented in a second region of the touch sensitive display. One or more candidates are identified based on the phonetic string. At least a subset of the candidates is presented. An input selecting one of the candidates is received. The entered phonetic string is replaced with the selected candidate.

195 citations


Proceedings ArticleDOI
07 Jan 2008
TL;DR: The essential property of resourcefulness is formalized-the correct use of keys to associate chunks in the input and output-by defining a refined semantic space of quasi-oblivious lenses, which several previously studied properties of lenses turn out to have compact characterizations in this space.
Abstract: A lens is a bidirectional program. When read from left toright, it denotes an ordinary function that maps inputs to outputs. When read from right to left, it denotes an ''update translator'' that takes an input together with an updated output and produces a new input that reflects the update. Many variants of this idea have been explored in the literature, but none deal fully with ordered data. If, for example, an update changes the order of a list in theoutput, the items in the output list and the chunks of the input that generated them can be misaligned, leading to lost or corrupted data.We attack this problem in the context of bidirectional transformations over strings, the primordial ordered data type. We first propose a collection of bidirectional string lens combinators, based on familiar operations on regular transducers (union, concatenation, Kleene-star) and with a type system based on regular expressions. We then design anew semantic space of dictionary lenses, enriching the lenses of Foster et al. (2007) with support for two additional combinators for marking ''reorderable chunks'' andtheir keys. To demonstrate the effectiveness of these primitives, we describe the design and implementation of Boomerang, a full-blown bidirectional programming language with dictionary lenses at its core. We have used Boomerang to build transformers for complex real-world data format sincluding the SwissProt genomic database.We formalize the essential property of resourcefulness-the correct use of keys to associate chunks in the input and output-by defining a refined semantic space of quasi-oblivious lenses. Several previously studied properties of lenses turn out to have compact characterizations in this space.

195 citations


Journal ArticleDOI
TL;DR: This work motivates the use of tree transducers for natural language and addresses the training problem for probabilistic tree- to-tree and tree-to-string transducers.
Abstract: Many probabilistic models for natural language are now written in terms of hierarchical tree structure. Tree-based modeling still lacks many of the standard tools taken for granted in (finite-state) string-based modeling. The theory of tree transducer automata provides a possible framework to draw on, as it has been worked out in an extensive literature. We motivate the use of tree transducers for natural language and address the training problem for probabilistic tree-to-tree and tree-to-string transducers.

176 citations


Journal ArticleDOI
TL;DR: A single multi-class kernel machine that informatively combines the available feature groups and is able to provide the state-of-the-art in performance accuracy on the fold recognition problem is demonstrated.
Abstract: Motivation: The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith–Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this article, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times. Results: The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines multi-class k-nn classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed. Contact: theo@dcs.gla.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

Patent
20 Oct 2008
TL;DR: In this article, the authors proposed a method for signature scanning in string fields, which includes processing one or more signatures into one or many formats that include fingerprints and data structures for each fixed-size signature or signature substring.
Abstract: Systems and methods for scanning signatures in a string field. In one implementation, the invention provides a method for signature scanning. The method includes processing one or more signatures into one or more formats that include one or more fingerprints and one or more follow-on search data structures for each fixed-size signature or signature substring such that the number of fingerprints for each fixed-size signature or signature substring is equal to a step size for a signature scanning operation and the particular fixed-size signature or signature substring is identifiable at any location within any string fields to be scanned, receiving a particular string field, identifying any signatures included in the particular string field including scanning for the fingerprints for each scan step size and searching for the follow-on search data structures at the locations where one or more fingerprints are found, and outputting any identified signatures.

PatentDOI
TL;DR: In this article, the authors proposed a voice query extension method that detects voice activity of a user from an input signal and extracting a feature vector from the voice activity, converting the feature vector into at least one phoneme sequence and generating the at least phoneme sequences; matching the matching words with words registered in a dictionary, extracting a string of matched words with a linguistic meaning, and selecting the string of the matched words as a query; determining whether the query is in a predetermined first language, and when the query was not in the first language as a result of the determining,
Abstract: A voice query extension method and system. The voice query extension method includes: detecting voice activity of a user from an input signal and extracting a feature vector from the voice activity; converting the feature vector into at least one phoneme sequence and generating the at least one phoneme sequence; matching the at least one phoneme sequence with words registered in a dictionary, extracting a string of the matched words with a linguistic meaning, and selecting the string of the matched words as a query; determining whether the query is in a predetermined first language, and when the query is not in the first language as a result of the determining, converting the query using a phoneme to grapheme rule, and generating a query in the first language; and searching using the query in the first language.

Proceedings ArticleDOI
07 Apr 2008
TL;DR: A programmatic framework of record matching that takes such user-defined string transformations as input, and is the first proposal for such a framework to be proposed.
Abstract: Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We propose a programmatic framework of record matching that takes such user-defined string transformations as input. To the best of our knowledge, this is the first proposal for such a framework. This transformational framework, while expressive, poses significant computational challenges which we address. We empirically evaluate our techniques over real data.

Journal ArticleDOI
TL;DR: Two optimal linear-time algorithms for computing the Longest Previous Factor (LPF) array corresponding to a string w are given and several properties and applications of LPF are investigated.

Journal Article
TL;DR: This article proposes a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions, and provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.
Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms---enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.

Proceedings Article
20 Jan 2008
TL;DR: Algorithms that require a number of traces exponential in Õ (√n) for any Δ < 1 even for worst case strings, and lower bound results for simpler classes of algorithms based on summary statistics from the traces are derived.
Abstract: We provide several new results for the trace reconstruction problem. In this setting, a binary string yields a collection of traces, where each trace is independently obtained by independently deleting each bit with a fixed probability Δ. Each trace therefore consists of a random subsequence of the original sequence. Given the traces, we wish to reconstruct the original string with high probability. The questions are how many traces are necessary for reconstruction, and how efficiently can the reconstruction be performed. Our primary result is that for some universal constant γ and uniformly chosen strings of length n, for any Δ n) traces in poly(n) time with high probability. We also obtain algorithms that require a number of traces exponential in O (√n) for any Δ

Journal ArticleDOI
TL;DR: String matching has sparked renewed research interest due to its usefulness for deep packet inspection in applications such as intrusion detection, virus scanning, and Internet content filtering.
Abstract: String matching has sparked renewed research interest due to its usefulness for deep packet inspection in applications such as intrusion detection, virus scanning, and Internet content filtering. Matching expressive pattern specifications with a scalable and efficient design, accelerating the entire packet flow, and string matching with high-level semantics are promising topics for further study.

Journal ArticleDOI
TL;DR: The problems surveyed here include the classification of classes of structures with automatic presentations, the complexity of the isomorphism problem, and the relationship between definability and recognisability.
Abstract: A structure has a (finite-string) automatic presentation if the elements of its domain can be named by finite strings in such a way that the coded domain and the coded atomic operations are recognised by synchronous multitape automata. Consequently, every structure with an automatic presentation has a decidable first-order theory. The problems surveyed here include the classification of classes of structures with automatic presentations, the complexity of the isomorphism problem, and the relationship between definability and recognisability.

Proceedings ArticleDOI
09 Jun 2008
TL;DR: This study proposes a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance and proposes an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries.
Abstract: Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their requencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance. We analyze how a gram dictionary affects the index structure of the string collection and ultimately the performance of queries. We also propose an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries. Our experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.

Proceedings ArticleDOI
18 Jun 2008
TL;DR: This paper investigates the use of a diagonal line, which is derived from Levenshtein distance, and simplified Smith-Waterman algorithm that is a classical tool in the identification and quantification of local similarities in biological sequences, with a view to the application in the plagiarism detection.
Abstract: Plagiarism in texts is issues of increasing concern to the academic community. Now most common text plagiarism occurs by making a variety of minor alterations that include the insertion, deletion, or substitution of words. Such simple changes, however, require excessive string comparisons. In this paper, we present a hybrid plagiarism detection method. We investigate the use of a diagonal line, which is derived from Levenshtein distance, and simplified Smith-Waterman algorithm that is a classical tool in the identification and quantification of local similarities in biological sequences, with a view to the application in the plagiarism detection. Our approach avoids globally involved string comparisons and considers psychological factors, which can yield significant speed-up by experiment results. Based on the results, we indicate the practicality of such improvement using Levenshtein distance and Smith-Waterman algorithm and to illustrate the efficiency gains. In the future, it would be interesting to explore appropriate heuristics in the area of text comparison.

BookDOI
14 Jul 2008
TL;DR: The author examines different facets of string handling and manipulations, discusses the interfacing of R with other languages, and describes how to write software packages, and concludes with a discussion on the debugging and profiling of R code.
Abstract: From the co-developer of R and lead founder of the Bioconductor Project Thanks to its data handling and modeling capabilities and its flexibility, R is becoming the most widely used software in bioinformatics. R Programming for Bioinformatics builds the programming skills needed to use R for solving bioinformatics and computational biology problems. Drawing on the authors experiences as an R expert, the book begins with coverage on the general properties of the R language, several unique programming aspects of R, and object-oriented programming in R. It presents methods for data input and output as well as database interactions. The author also examines different facets of string handling and manipulations, discusses the interfacing of R with other languages, and describes how to write software packages. He concludes with a discussion on the debugging and profiling of R code.

Book ChapterDOI
10 Aug 2008
TL;DR: This work presents an automata-based approach for the verification of string operations in PHP programs based on symbolic string analysis, and proposes a novel algorithm for language-based replacement that works quite well in checking the correctness of sanitization operations in real-world PHP applications.
Abstract: We present an automata-based approach for the verification of string operations in PHP programs based on symbolic string analysis. String analysis is a static analysis technique that determines the values that a string expression can take during program execution at a given program point. This information can be used to verify that string values are sanitized properly and to detect programming errors and security vulnerabilities. In our string analysis approach, we encode the set of string values that string variables can take as automata. We implement all string functions using a symbolic automata representation (MBDD representation from the MONA automata package) and leverage efficient manipulations on MBDDs, e.g., determinization and minimization. Particularly, we propose a novel algorithm for language-based replacement. Our replacement function takes three DFAs as arguments and outputs a DFA. Finally, we apply a widening operator defined on automata to approximate fixpoint computations. If this conservative approximation does not include any badpatterns (specified as regular expressions), we conclude that the program does not contain any errors or vulnerabilities. Our experimental results demonstrate that our approach works quite well in checking the correctness of sanitization operations in real-world PHP applications.

Patent
16 Oct 2008
TL;DR: In this article, a method and system for accessing textual widgets is described, which includes: entering a string expression into a document, invoking a spell-checker to check a spelling of the string expression, marking it as misspelled, identifying a textual widget based on the misspelling of the text expression, evaluating the misspelled string expression using the identified textual widget, displaying the at least one result of the evaluation, selecting a result of evaluation, and replacing the string expressions in the document with the selected result of an evaluation.
Abstract: The disclosure is directed to a method and system for accessing textual widgets. A method in accordance with an embodiment includes: entering a string expression into a document; invoking a spell-checker to check a spelling of the string expression; marking the string expression as misspelled; identifying a textual widget based on the misspelling of the string expression; evaluating the misspelled string expression using the identified textual widget, the identified textual widget returning at least one result of the evaluation; displaying the at least one result of the evaluation; selecting a result of the evaluation; and replacing the string expression in the document with the selected result of the evaluation.

Journal ArticleDOI
TL;DR: SYBYL line notation is a powerful way to represent molecular structures, reactions, libraries of structures, molecular fragments, formulations, molecular queries, and reaction queries comparable to SMARTS.
Abstract: SYBYL line notation (SLN) is a powerful way to represent molecular structures, reactions, libraries of structures, molecular fragments, formulations, molecular queries, and reaction queries. Nearly any chemical structure imaginable, including macromolecules, pharmaceuticals, catalysts, and even combinatorial libraries can be represented as an SLN string. The language provides a rich syntax for database queries comparable to SMARTS. It provides full Markush, R-Group, reaction, and macro atom capabilities in a single unified notation. It includes the ability to specify 3D conformations and 2D depictions. All the information necessary to recreate the structure in a modeling or drawing package is present in a single, concise string of ASCII characters. This makes SLN ideal for structure communication over global computer networks between applications sitting at remote sites. Unlike SMILES and its derivatives, SLN accomplishes this within a single unified syntax. Structures, queries, compounds, reactions, and ...

Proceedings ArticleDOI
20 Jul 2008
TL;DR: Splat, a tool for automatically generating inputs that lead to memory safety violations in C programs, is presented, and it is experimentally demonstrated that the symbolic length abstraction is both scalable and sufficient to uncover many real buffer overflows in C Programs.
Abstract: We present Splat, a tool for automatically generating inputs that lead to memory safety violations in C programs. Splat performs directed random testing of the code, guided by symbolic execution. However, instead of representing the entire contents of an input buffer symbolically, Splat tracks only a prefix of the buffer symbolically, and a symbolic length that may exceed the size of the symbolic prefix. The part of the buffer beyond the symbolic prefix is filled with concrete random inputs. The use of symbolic buffer lengths makes it possible to compactly summarize the behavior of standard buffer manipulation functions, such as string library functions, leading to a more scalable search for possible memory errors. While reasoning only about prefixes of buffer contents makes the search theoretically incomplete, we experimentally demonstrate that the symbolic length abstraction is both scalable and sufficient to uncover many real buffer overflows in C programs. In experiments on a set of benchmarks developed independently to evaluate buffer overflow checkers, Splat was able to detect buffer overflows quickly, sometimes several orders of magnitude faster than when symbolically representing entire buffers. Splat was also able to find two previously unknown buffer overflows in a heavily-tested storage system.

Patent
31 Mar 2008
TL;DR: In this paper, the authors propose associating a source string with a target content unit stored on a content addressable storage (CAS) system, which may be accomplished, in some embodiments, by storing on the CAS system an associative content unit that includes the source string in its binding part and includes the target content units in its non-binding part.
Abstract: Embodiments of the invention relate to associating a source string with a target content unit stored on a content addressable storage (CAS) system. This may be accomplished, in some embodiments, by storing on the CAS system an associative content unit that includes the source string in its binding part and includes the target content unit in its non-binding part.

PatentDOI
TL;DR: In this paper, the authors identify a generic model of speech composed of phonemes, identify a family of interchangeable phonemic alternatives for a phoneme in the generic model, and generate a pronunciation model which substitutes each family for each respective phoneme.
Abstract: Systems, computer-implemented methods, and tangible computer-readable media for generating a pronunciation model. The method includes identifying a generic model of speech composed of phonemes, identifying a family of interchangeable phonemic alternatives for a phoneme in the generic model of speech, labeling the family of interchangeable phonemic alternatives as referring to the same phoneme, and generating a pronunciation model which substitutes each family for each respective phoneme. In one aspect, the generic model of speech is a vocal tract length normalized acoustic model. Interchangeable phonemic alternatives can represent a same phoneme for different dialectal classes. An interchangeable phonemic alternative can include a string of phonemes.

Journal ArticleDOI
TL;DR: An improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison is proposed and is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.
Abstract: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.

01 Jan 2008
TL;DR: Syntax trees almost copying the abstract grammar: type term = Var of string | Fn of string * term list to express in a direct recursive way the function returning the set of variables in a term.
Abstract: syntax trees almost copying the abstract grammar: type term = Var of string | Fn of string * term list;; and express in a direct recursive way the function returning the set of variables in a term: let rec vars tm =