Showing papers on "String (computer science) published in 2008"

PDF

Open Access

Journal Article•DOI•

[...]

31 Mar 2008-Journal of Statistical Software

TL;DR: The tm package is presented which provides a framework for text mining applications within R and techniques for count-based analysis methods, text clustering, text classification and string kernels are presented.

...read moreread less

Abstract: During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

...read moreread less

1,057 citations

Book Chapter•DOI•

Founding Cryptography on Oblivious Transfer --- Efficiently

[...]

Yuval Ishai¹, Manoj Prabhakaran², Amit Sahai¹•Institutions (2)

University of California, Los Angeles¹, University of Illinois at Urbana–Champaign²

17 Aug 2008

TL;DR: A simple and efficient compiler is presented for transforming secure multi-party computation protocols that enjoy security only with an honest majority into MPC protocols that guarantee security with no honest majority, in the oblivious-transfer (OT) hybrid model.

...read moreread less

Abstract: We present a simple and efficient compiler for transforming secure multi-party computation (MPC) protocols that enjoy security only with an honest majority into MPC protocols that guarantee security with no honest majority, in the oblivious-transfer (OT) hybrid model. Our technique works by combining a secure protocol in the honest majority setting with a protocol achieving only security against semi-honestparties in the setting of no honest majority. Applying our compiler to variants of protocols from the literature, we get several applications for secure two-party computation and for MPC with no honest majority. These include: Constant-rate two-party computation in the OT-hybrid model. We obtain a statistically UC-secure two-party protocol in the OT-hybrid model that can evaluate a general circuit Cof size sand depth dwith a total communication complexity of O(s) + poly(k, d, log s) and O(d) rounds. The above result generalizes to a constant number of parties. Extending OTs in the malicious model. We obtain a computationally efficient protocol for generating many string OTs from few string OTs with only a constant amortized communication overheadcompared to the total length of the string OTs. Black-box constructions for constant-round MPC with no honest majority. We obtain general computationally UC-secure MPC protocols in the OT-hybrid model that use only a constant number of rounds, and only make a black-boxaccess to a pseudorandom generator. This gives the first constant-round protocols for three or more parties that only make a black-box use of cryptographic primitives (and avoid expensive zero-knowledge proofs).

...read moreread less

635 citations

Proceedings Article•

ParsCit: an Open-source CRF Reference String Parsing Package

[...]

Isaac G. Councill¹, C. Lee Giles², Min-Yen Kan³•Institutions (3)

Pennsylvania State University¹, Penn State College of Information Sciences and Technology², National University of Singapore³

01 May 2008

TL;DR: Parsing package ParsCit is described, a freely available, open-source implementation of a reference string parsing package that wraps a trained conditional random field model with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts.

...read moreread less

Abstract: We describe ParsCit, a freely available, open-source implementation of a reference string parsing package. At the core of ParsCit is a trained conditional random field (CRF) model used to label the token sequences in the reference string. A heuristic model wraps this core with added functionality to identify reference strings from a plain text file, and to retrieve the citation contexts. The package comes with utilities to run it as a web service or as a standalone utility. We compare ParsCit on three distinct reference string datasets and show that it compares well with other previously published work.

...read moreread less

339 citations

Journal Article•DOI•

Finding Transition Pathways Using the String Method with Swarms of Trajectories

[...]

Albert C. Pan¹, Deniz Sezer², Benoît Roux²•Institutions (2)

University of Chicago¹, Argonne National Laboratory²

22 Feb 2008-Journal of Physical Chemistry B

TL;DR: In this article, an approach to find transition pathways in complex systems is presented, which consists of refining a putative transition path in the multidimensional space supported by a set of collective variables using the average dynamic drift of those variables.

...read moreread less

Abstract: An approach to find transition pathways in complex systems is presented. The method, which is related to the string method in collective variables of Maragliano et al. (J. Chem. Phys. 2006, 125, 024106), is conceptually simple and straightforward to implement. It consists of refining a putative transition path in the multidimensional space supported by a set of collective variables using the average dynamic drift of those variables. This drift is estimated on-the-fly via swarms of short unbiased trajectories started at different points along the path. Successive iterations of this algorithm, which can be naturally distributed over many computer nodes with negligible interprocessor communication, refine an initial trial path toward the most probable transition path (MPTP) between two stable basins. The method is first tested by determining the pathway for the C7eq to C7ax transition in an all-atom model of the alanine dipeptide in vacuum, which has been studied previously with the string method in collecti...

...read moreread less

306 citations

Proceedings Article•DOI•

Efficient Merging and Filtering Algorithms for Approximate String Searches

[...]

Chen Li¹, Jiaheng Lu¹, Yiming Lu¹•Institutions (1)

University of California, Irvine¹

07 Apr 2008

TL;DR: This paper develops several algorithms that can greatly improve the performance of existing algorithms and studies how to integrate existing filtering techniques with these algorithms, and shows that they should be used together judiciously.

...read moreread less

Abstract: We study the following problem: how to efficiently find in a collection of strings those similar to a given query string? Various similarity functions can be used, such as edit distance, Jaccard similarity, and cosine similarity. This problem is of great interests to a variety of applications that need a high real-time performance, such as data cleaning, query relaxation, and spellchecking. Several algorithms have been proposed based on the idea of merging inverted lists of grams generated from the strings. In this paper we make two contributions. First, we develop several algorithms that can greatly improve the performance of existing algorithms. Second, we study how to integrate existing filtering techniques with these algorithms, and show that they should be used together judiciously, since the way to do the integration can greatly affect the performance. We have conducted experiments on several real data sets to evaluate the proposed techniques.

...read moreread less

296 citations

Patent•

Automated test input generation for web applications

[...]

Gary Wassermann¹, Dachuan Yu¹, Ajay Chandler¹, Dinakar Dhurjati¹, Hiroshi Inamura¹ - Show less +1 more•Institutions (1)

NTT DoCoMo¹

03 Nov 2008

TL;DR: In this paper, a method and apparatus for automated test input generation for web applications is described, which comprises performing a source-to-source transformation of the program, performing interpretation on the program based on a set of test input values; symbolically executing the program; recording a symbolic constraint for each of one or more conditional expressions encountered during execution, including analyzing a string operation in the program to identify possible execution paths, and generating symbolic inputs representing values of variables in each of the conditional expressions as a numeric expression and a string constraint including generating constraints on string values.

...read moreread less

Abstract: A method and apparatus is disclosed herein for automated test input generation for web applications. In one embodiment, the method comprises performing a source-to-source transformation of the program; performing interpretation on the program based on a set of test input values; symbolically executing the program; recording a symbolic constraint for each of one or more conditional expressions encountered during execution of the program, including analyzing a string operation in the program to identify one or more possible execution paths, and generating symbolic inputs representing values of variables in each of the conditional expressions as a numeric expression and a string constraint including generating constraints on string values by modeling string operations using finite state transducers (FSTs) and supplying values from the program's execution in place of intractable sub-expressions; and generating new inputs to drive the program during a subsequent iteration based on results of solving the recorded string constraints.

...read moreread less

231 citations

Proceedings Article•DOI•

Dynamic test input generation for web applications

[...]

Gary Wassermann¹, Dachuan Yu², Ajay Chander², Dinakar Dhurjati², Hiroshi Inamura², Zhendong Su¹ - Show less +2 more•Institutions (2)

University of California, Davis¹, NTT DoCoMo²

20 Jul 2008

TL;DR: An automated input test generation algorithm that uses runtime values to analyze dynamic code, models the semantics of string operations, and handles operations whose argument and return values may not share a common type is proposed.

...read moreread less

Abstract: Web applications routinely handle sensitive data, and many people rely on them to support various daily activities, so errors can have severe and broad-reaching consequences. Unlike most desktop applications, many web applications are written in scripting languages, such as PHP. The dynamic features commonly supported by these languages significantly inhibit static analysis and existing static analysis of these languages can fail to produce meaningful results on realworld web applications.Automated test input generation using the concolic testing framework has proven useful for finding bugs and improving test coverage on C and Java programs, which generally emphasize numeric values and pointer-based data structures. However, scripting languages, such as PHP, promote a style of programming for developing web applications that emphasizes string values, objects, and arrays.In this paper, we propose an automated input test generation algorithm that uses runtime values to analyze dynamic code, models the semantics of string operations, and handles operations whose argument and return values may not share a common type. As in the standard concolic testing framework, our algorithm gathers constraints during symbolic execution. Our algorithm resolves constraints over multiple types by considering each variable instance individually, so that it only needs to invert each operation. By recording constraints selectively, our implementation successfully finds bugs in real-world web applications which state-of-the-art static analysis tools fail to analyze.

...read moreread less

203 citations

Patent•

Language input interface on a device

[...]

Yasuo Kida¹, Ken Kocienda¹, Elizabeth Caroline Furches¹•Institutions (1)

Apple Inc.¹

04 Mar 2008

TL;DR: In this paper, a virtual keyboard is presented in a first region of a touch sensitive display of a device and an input representing a phonetic string is received on the virtual keyboard.

...read moreread less

Abstract: Methods, systems, and apparatus, including computer program products, for inputting text. A virtual keyboard is presented in a first region of a touch sensitive display of a device. An input representing a phonetic string is received on the virtual keyboard. The entered phonetic string is presented in a second region of the touch sensitive display. One or more candidates are identified based on the phonetic string. At least a subset of the candidates is presented. An input selecting one of the candidates is received. The entered phonetic string is replaced with the selected candidate.

...read moreread less

195 citations

Proceedings Article•DOI•

Boomerang: resourceful lenses for string data

[...]

Aaron Bohannon¹, J. Nathan Foster¹, Benjamin C. Pierce¹, Alexandre Pilkiewicz², Alan Schmitt³ - Show less +1 more•Institutions (3)

University of Pennsylvania¹, École Polytechnique², French Institute for Research in Computer Science and Automation³

07 Jan 2008

TL;DR: The essential property of resourcefulness is formalized-the correct use of keys to associate chunks in the input and output-by defining a refined semantic space of quasi-oblivious lenses, which several previously studied properties of lenses turn out to have compact characterizations in this space.

...read moreread less

Abstract: A lens is a bidirectional program. When read from left toright, it denotes an ordinary function that maps inputs to outputs. When read from right to left, it denotes an ''update translator'' that takes an input together with an updated output and produces a new input that reflects the update. Many variants of this idea have been explored in the literature, but none deal fully with ordered data. If, for example, an update changes the order of a list in theoutput, the items in the output list and the chunks of the input that generated them can be misaligned, leading to lost or corrupted data.We attack this problem in the context of bidirectional transformations over strings, the primordial ordered data type. We first propose a collection of bidirectional string lens combinators, based on familiar operations on regular transducers (union, concatenation, Kleene-star) and with a type system based on regular expressions. We then design anew semantic space of dictionary lenses, enriching the lenses of Foster et al. (2007) with support for two additional combinators for marking ''reorderable chunks'' andtheir keys. To demonstrate the effectiveness of these primitives, we describe the design and implementation of Boomerang, a full-blown bidirectional programming language with dictionary lenses at its core. We have used Boomerang to build transformers for complex real-world data format sincluding the SwissProt genomic database.We formalize the essential property of resourcefulness-the correct use of keys to associate chunks in the input and output-by defining a refined semantic space of quasi-oblivious lenses. Several previously studied properties of lenses turn out to have compact characterizations in this space.

...read moreread less

195 citations

Journal Article•DOI•

Training tree transducers

[...]

Jonathan Graehl¹, Kevin Knight¹, Jonathan May¹•Institutions (1)

University of Southern California¹

01 Sep 2008-Computational Linguistics

TL;DR: This work motivates the use of tree transducers for natural language and addresses the training problem for probabilistic tree- to-tree and tree-to-string transducers.

...read moreread less

Abstract: Many probabilistic models for natural language are now written in terms of hierarchical tree structure. Tree-based modeling still lacks many of the standard tools taken for granted in (finite-state) string-based modeling. The theory of tree transducer automata provides a possible framework to draw on, as it has been worked out in an extensive literature. We motivate the use of tree transducers for natural language and address the training problem for probabilistic tree-to-tree and tree-to-string transducers.

...read moreread less

176 citations

Journal Article•DOI•

Probabilistic multi-class multi-kernel learning

[...]

Theodoros Damoulas¹, Mark Girolami¹•Institutions (1)

University of Glasgow¹

01 May 2008-Bioinformatics

TL;DR: A single multi-class kernel machine that informatively combines the available feature groups and is able to provide the state-of-the-art in performance accuracy on the fold recognition problem is demonstrated.

...read moreread less

Abstract: Motivation: The problems of protein fold recognition and remote homology detection have recently attracted a great deal of interest as they represent challenging multi-feature multi-class problems for which modern pattern recognition methods achieve only modest levels of performance. As with many pattern recognition problems, there are multiple feature spaces or groups of attributes available, such as global characteristics like the amino-acid composition (C), predicted secondary structure (S), hydrophobicity (H), van der Waals volume (V), polarity (P), polarizability (Z), as well as attributes derived from local sequence alignment such as the Smith–Waterman scores. This raises the need for a classification method that is able to assess the contribution of these potentially heterogeneous object descriptors while utilizing such information to improve predictive performance. To that end, we offer a single multi-class kernel machine that informatively combines the available feature groups and, as is demonstrated in this article, is able to provide the state-of-the-art in performance accuracy on the fold recognition problem. Furthermore, the proposed approach provides some insight by assessing the significance of recently introduced protein features and string kernels. The proposed method is well-founded within a Bayesian hierarchical framework and a variational Bayes approximation is derived which allows for efficient CPU processing times. Results: The best performance which we report on the SCOP PDB-40D benchmark data-set is a 70% accuracy by combining all the available feature groups from global protein characteristics but also including sequence-alignment features. We offer an 8% improvement on the best reported performance that combines multi-class k-nn classifiers while at the same time reducing computational costs and assessing the predictive power of the various available features. Furthermore, we examine the performance of our methodology on the SCOP 1.53 benchmark data-set that simulates remote homology detection and examine the combination of various state-of-the-art string kernels that have recently been proposed. Contact: theo@dcs.gla.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.

...read moreread less

Patent•

Fast signature scan

[...]

Qiang Wang

20 Oct 2008

TL;DR: In this article, the authors proposed a method for signature scanning in string fields, which includes processing one or more signatures into one or many formats that include fingerprints and data structures for each fixed-size signature or signature substring.

...read moreread less

Abstract: Systems and methods for scanning signatures in a string field. In one implementation, the invention provides a method for signature scanning. The method includes processing one or more signatures into one or more formats that include one or more fingerprints and one or more follow-on search data structures for each fixed-size signature or signature substring such that the number of fingerprints for each fixed-size signature or signature substring is equal to a step size for a signature scanning operation and the particular fixed-size signature or signature substring is identifiable at any location within any string fields to be scanned, receiving a particular string field, identifying any signatures included in the particular string field including scanning for the fingerprints for each scan step size and searching for the follow-on search data structures at the locations where one or more fingerprints are found, and outputting any identified signatures.

...read moreread less

Patent•DOI•

Voice query extension method and system

[...]

Jeong Mi Cho¹, Byung-kwan Kwak¹, Namhoon Kim¹, Ick Sang Han¹•Institutions (1)

Samsung¹

10 Mar 2008-Journal of the Acoustical Society of America

TL;DR: In this article, the authors proposed a voice query extension method that detects voice activity of a user from an input signal and extracting a feature vector from the voice activity, converting the feature vector into at least one phoneme sequence and generating the at least phoneme sequences; matching the matching words with words registered in a dictionary, extracting a string of matched words with a linguistic meaning, and selecting the string of the matched words as a query; determining whether the query is in a predetermined first language, and when the query was not in the first language as a result of the determining,

...read moreread less

Abstract: A voice query extension method and system. The voice query extension method includes: detecting voice activity of a user from an input signal and extracting a feature vector from the voice activity; converting the feature vector into at least one phoneme sequence and generating the at least one phoneme sequence; matching the at least one phoneme sequence with words registered in a dictionary, extracting a string of the matched words with a linguistic meaning, and selecting the string of the matched words as a query; determining whether the query is in a predetermined first language, and when the query is not in the first language as a result of the determining, converting the query using a phoneme to grapheme rule, and generating a query in the first language; and searching using the query in the first language.

...read moreread less

Proceedings Article•DOI•

Transformation-based Framework for Record Matching

[...]

Arvind Arasu¹, Surajit Chaudhuri¹, Raghav Kaushik¹•Institutions (1)

Microsoft¹

07 Apr 2008

TL;DR: A programmatic framework of record matching that takes such user-defined string transformations as input, and is the first proposal for such a framework to be proposed.

...read moreread less

Abstract: Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We propose a programmatic framework of record matching that takes such user-defined string transformations as input. To the best of our knowledge, this is the first proposal for such a framework. This transformational framework, while expressive, poses significant computational challenges which we address. We empirically evaluate our techniques over real data.

...read moreread less

Journal Article•DOI•

Computing Longest Previous Factor in linear time and applications

[...]

Maxime Crochemore¹, Lucian Ilie²•Institutions (2)

King's College London¹, University of Western Ontario²

01 Apr 2008-Information Processing Letters

TL;DR: Two optimal linear-time algorithms for computing the Longest Previous Factor (LPF) array corresponding to a string w are given and several properties and applications of LPF are investigated.

...read moreread less

Journal Article•

Linear-Time Computation of Similarity Measures for Sequential Data

[...]

Konrad Rieck¹, Pavel Laskov¹•Institutions (1)

Fraunhofer Society¹

01 Jun 2008-Journal of Machine Learning Research

TL;DR: This article proposes a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions, and provides linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures.

...read moreread less

Abstract: Efficient and expressive comparison of sequences is an essential procedure for learning with sequential data. In this article we propose a generic framework for computation of similarity measures for sequences, covering various kernel, distance and non-metric similarity functions. The basis for comparison is embedding of sequences using a formal language, such as a set of natural words, k-grams or all contiguous subsequences. As realizations of the framework we provide linear-time algorithms of different complexity and capabilities using sorted arrays, tries and suffix trees as underlying data structures. Experiments on data sets from bioinformatics, text processing and computer security illustrate the efficiency of the proposed algorithms---enabling peak performances of up to 106 pairwise comparisons per second. The utility of distances and non-metric similarity measures for sequences as alternatives to string kernels is demonstrated in applications of text categorization, network intrusion detection and transcription site recognition in DNA.

...read moreread less

Proceedings Article•

Trace reconstruction with constant deletion probability and related results

[...]

Thomas Holenstein¹, Michael Mitzenmacher², Rina Panigrahy¹, Udi Wieder¹•Institutions (2)

Microsoft¹, Harvard University²

20 Jan 2008

TL;DR: Algorithms that require a number of traces exponential in Õ (√n) for any Δ < 1 even for worst case strings, and lower bound results for simpler classes of algorithms based on summary statistics from the traces are derived.

...read moreread less

Abstract: We provide several new results for the trace reconstruction problem. In this setting, a binary string yields a collection of traces, where each trace is independently obtained by independently deleting each bit with a fixed probability Δ. Each trace therefore consists of a random subsequence of the original sequence. Given the traces, we wish to reconstruct the original string with high probability. The questions are how many traces are necessary for reconstruction, and how efficiently can the reconstruction be performed. Our primary result is that for some universal constant γ and uniformly chosen strings of length n, for any Δ n) traces in poly(n) time with high probability. We also obtain algorithms that require a number of traces exponential in O (√n) for any Δ

...read moreread less

Journal Article•DOI•

Using String Matching for Deep Packet Inspection

[...]

Po-Ching Lin¹, Ying-Dar Lin¹, Tsern-Huei Lee¹, Yuan-Cheng Lai¹•Institutions (1)

National Chiao Tung University¹

01 Apr 2008-IEEE Computer

TL;DR: String matching has sparked renewed research interest due to its usefulness for deep packet inspection in applications such as intrusion detection, virus scanning, and Internet content filtering.

...read moreread less

Abstract: String matching has sparked renewed research interest due to its usefulness for deep packet inspection in applications such as intrusion detection, virus scanning, and Internet content filtering. Matching expressive pattern specifications with a scalable and efficient design, accelerating the entire packet flow, and string matching with high-level semantics are promising topics for further study.

...read moreread less

Journal Article•DOI•

Automata presenting structures: A survey of the finite string case

[...]

Sasha Rubin¹•Institutions (1)

University of Auckland¹

01 Jun 2008-The Bulletin of Symbolic Logic

TL;DR: The problems surveyed here include the classification of classes of structures with automatic presentations, the complexity of the isomorphism problem, and the relationship between definability and recognisability.

...read moreread less

Abstract: A structure has a (finite-string) automatic presentation if the elements of its domain can be named by finite strings in such a way that the coded domain and the coded atomic operations are recognised by synchronous multitape automata. Consequently, every structure with an automatic presentation has a decidable first-order theory. The problems surveyed here include the classification of classes of structures with automatic presentations, the complexity of the isomorphism problem, and the relationship between definability and recognisability.

...read moreread less

Proceedings Article•DOI•

Cost-based variable-length-gram selection for string collections to support approximate queries efficiently

[...]

Xiaochun Yang¹, Bin Wang¹, Chen Li²•Institutions (2)

Northeastern University (China)¹, University of California, Irvine²

09 Jun 2008

TL;DR: This study proposes a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance and proposes an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries.

...read moreread less

Abstract: Approximate queries on a collection of strings are important in many applications such as record linkage, spell checking, and Web search, where inconsistencies and errors exist in data as well as queries. Several existing algorithms use the concept of "grams," which are substrings of strings used as signatures for the strings to build index structures. A recently proposed technique, called VGRAM, improves the performance of these algorithms by using a carefully chosen dictionary of variable-length grams based on their requencies in the string collection. Since an index structure using fixed-length grams can be viewed as a special case of VGRAM, a fundamental problem arises naturally: what is the relationship between the gram dictionary and the performance of queries? We study this problem in this paper. We propose a dynamic programming algorithm for computing a tight lower bound on the number of common grams shared by two similar strings in order to improve query performance. We analyze how a gram dictionary affects the index structure of the string collection and ultimately the performance of queries. We also propose an algorithm for automatically computing a dictionary of high-quality grams for a workload of queries. Our experiments on real data sets show the improvement on query performance achieved by these techniques. To our best knowledge, this study is the first cost-based quantitative approach to deciding good grams for approximate string queries.

...read moreread less

Proceedings Article•DOI•

Plagiarism Detection Using the Levenshtein Distance and Smith-Waterman Algorithm

[...]

Zhan Su¹, Byung-Ryul Ahn¹, Ki-Yol Eom¹, Min-Koo Kang¹, Jin-Pyung Kim¹, Moon-Kyun Kim¹ - Show less +2 more•Institutions (1)

Sungkyunkwan University¹

18 Jun 2008

TL;DR: This paper investigates the use of a diagonal line, which is derived from Levenshtein distance, and simplified Smith-Waterman algorithm that is a classical tool in the identification and quantification of local similarities in biological sequences, with a view to the application in the plagiarism detection.

...read moreread less

Abstract: Plagiarism in texts is issues of increasing concern to the academic community. Now most common text plagiarism occurs by making a variety of minor alterations that include the insertion, deletion, or substitution of words. Such simple changes, however, require excessive string comparisons. In this paper, we present a hybrid plagiarism detection method. We investigate the use of a diagonal line, which is derived from Levenshtein distance, and simplified Smith-Waterman algorithm that is a classical tool in the identification and quantification of local similarities in biological sequences, with a view to the application in the plagiarism detection. Our approach avoids globally involved string comparisons and considers psychological factors, which can yield significant speed-up by experiment results. Based on the results, we indicate the practicality of such improvement using Levenshtein distance and Smith-Waterman algorithm and to illustrate the efficiency gains. In the future, it would be interesting to explore appropriate heuristics in the area of text comparison.

...read moreread less

Book•DOI•

R Programming for Bioinformatics

[...]

Robert Gentleman

14 Jul 2008

TL;DR: The author examines different facets of string handling and manipulations, discusses the interfacing of R with other languages, and describes how to write software packages, and concludes with a discussion on the debugging and profiling of R code.

...read moreread less

Abstract: From the co-developer of R and lead founder of the Bioconductor Project Thanks to its data handling and modeling capabilities and its flexibility, R is becoming the most widely used software in bioinformatics. R Programming for Bioinformatics builds the programming skills needed to use R for solving bioinformatics and computational biology problems. Drawing on the authors experiences as an R expert, the book begins with coverage on the general properties of the R language, several unique programming aspects of R, and object-oriented programming in R. It presents methods for data input and output as well as database interactions. The author also examines different facets of string handling and manipulations, discusses the interfacing of R with other languages, and describes how to write software packages. He concludes with a discussion on the debugging and profiling of R code.

...read moreread less

Book Chapter•DOI•

Symbolic String Verification: An Automata-Based Approach

[...]

Fang Yu¹, Tevfik Bultan¹, Marco Cova¹, Oscar H. Ibarra¹•Institutions (1)

University of California, Santa Barbara¹

10 Aug 2008

TL;DR: This work presents an automata-based approach for the verification of string operations in PHP programs based on symbolic string analysis, and proposes a novel algorithm for language-based replacement that works quite well in checking the correctness of sanitization operations in real-world PHP applications.

...read moreread less

Abstract: We present an automata-based approach for the verification of string operations in PHP programs based on symbolic string analysis. String analysis is a static analysis technique that determines the values that a string expression can take during program execution at a given program point. This information can be used to verify that string values are sanitized properly and to detect programming errors and security vulnerabilities. In our string analysis approach, we encode the set of string values that string variables can take as automata. We implement all string functions using a symbolic automata representation (MBDD representation from the MONA automata package) and leverage efficient manipulations on MBDDs, e.g., determinization and minimization. Particularly, we propose a novel algorithm for language-based replacement. Our replacement function takes three DFAs as arguments and outputs a DFA. Finally, we apply a widening operator defined on automata to approximate fixpoint computations. If this conservative approximation does not include any badpatterns (specified as regular expressions), we conclude that the program does not contain any errors or vulnerabilities. Our experimental results demonstrate that our approach works quite well in checking the correctness of sanitization operations in real-world PHP applications.

...read moreread less

Patent•

Method and system for accessing textual widgets

[...]

Daniel J. McCloskey¹, Alena Kucharenka, Pavel Volkov, Carol S. Zimmet•Institutions (1)

IBM¹

16 Oct 2008

TL;DR: In this article, a method and system for accessing textual widgets is described, which includes: entering a string expression into a document, invoking a spell-checker to check a spelling of the string expression, marking it as misspelled, identifying a textual widget based on the misspelling of the text expression, evaluating the misspelled string expression using the identified textual widget, displaying the at least one result of the evaluation, selecting a result of evaluation, and replacing the string expressions in the document with the selected result of an evaluation.

...read moreread less

Abstract: The disclosure is directed to a method and system for accessing textual widgets. A method in accordance with an embodiment includes: entering a string expression into a document; invoking a spell-checker to check a spelling of the string expression; marking the string expression as misspelled; identifying a textual widget based on the misspelling of the string expression; evaluating the misspelled string expression using the identified textual widget, the identified textual widget returning at least one result of the evaluation; displaying the at least one result of the evaluation; selecting a result of the evaluation; and replacing the string expression in the document with the selected result of the evaluation.

...read moreread less

Journal Article•DOI•

SYBYL line notation (SLN): a single notation to represent chemical structures, queries, reactions, and virtual libraries.

[...]

R. Webster Homer, Jon Swanson, Robert J. Jilek, Tad Hurst, Robert D. Clark - Show less +1 more

11 Nov 2008-Journal of Chemical Information and Modeling

TL;DR: SYBYL line notation is a powerful way to represent molecular structures, reactions, libraries of structures, molecular fragments, formulations, molecular queries, and reaction queries comparable to SMARTS.

...read moreread less

Abstract: SYBYL line notation (SLN) is a powerful way to represent molecular structures, reactions, libraries of structures, molecular fragments, formulations, molecular queries, and reaction queries. Nearly any chemical structure imaginable, including macromolecules, pharmaceuticals, catalysts, and even combinatorial libraries can be represented as an SLN string. The language provides a rich syntax for database queries comparable to SMARTS. It provides full Markush, R-Group, reaction, and macro atom capabilities in a single unified notation. It includes the ability to specify 3D conformations and 2D depictions. All the information necessary to recreate the structure in a modeling or drawing package is present in a single, concise string of ASCII characters. This makes SLN ideal for structure communication over global computer networks between applications sitting at remote sites. Unlike SMILES and its derivatives, SLN accomplishes this within a single unified syntax. Structures, queries, compounds, reactions, and ...

...read moreread less

Proceedings Article•DOI•

Testing for buffer overflows with length abstraction

[...]

Ru-Gang Xu¹, Patrice Godefroid², Rupak Majumdar¹•Institutions (2)

University of California, Los Angeles¹, Microsoft²

20 Jul 2008

TL;DR: Splat, a tool for automatically generating inputs that lead to memory safety violations in C programs, is presented, and it is experimentally demonstrated that the symbolic length abstraction is both scalable and sufficient to uncover many real buffer overflows in C Programs.

...read moreread less

Abstract: We present Splat, a tool for automatically generating inputs that lead to memory safety violations in C programs. Splat performs directed random testing of the code, guided by symbolic execution. However, instead of representing the entire contents of an input buffer symbolically, Splat tracks only a prefix of the buffer symbolically, and a symbolic length that may exceed the size of the symbolic prefix. The part of the buffer beyond the symbolic prefix is filled with concrete random inputs. The use of symbolic buffer lengths makes it possible to compactly summarize the behavior of standard buffer manipulation functions, such as string library functions, leading to a more scalable search for possible memory errors. While reasoning only about prefixes of buffer contents makes the search theoretically incomplete, we experimentally demonstrate that the symbolic length abstraction is both scalable and sufficient to uncover many real buffer overflows in C programs. In experiments on a set of benchmarks developed independently to evaluate buffer overflow checkers, Splat was able to detect buffer overflows quickly, sometimes several orders of magnitude faster than when symbolically representing entire buffers. Splat was also able to find two previously unknown buffer overflows in a heavily-tested storage system.

...read moreread less

Patent•

Associating an identifier with a content unit

[...]

Mark A. O'Connell¹, Michael Kilian¹•Institutions (1)

EMC Corporation¹

31 Mar 2008

TL;DR: In this paper, the authors propose associating a source string with a target content unit stored on a content addressable storage (CAS) system, which may be accomplished, in some embodiments, by storing on the CAS system an associative content unit that includes the source string in its binding part and includes the target content units in its non-binding part.

...read moreread less

Abstract: Embodiments of the invention relate to associating a source string with a target content unit stored on a content addressable storage (CAS) system. This may be accomplished, in some embodiments, by storing on the CAS system an associative content unit that includes the source string in its binding part and includes the target content unit in its non-binding part.

...read moreread less

Patent•DOI•

System and method for pronunciation modeling

[...]

Andrej Ljolje¹, Alistair Conkie¹, Ann K. Syrdal•Institutions (1)

AT&T¹

04 Dec 2008-Journal of the Acoustical Society of America

TL;DR: In this paper, the authors identify a generic model of speech composed of phonemes, identify a family of interchangeable phonemic alternatives for a phoneme in the generic model, and generate a pronunciation model which substitutes each family for each respective phoneme.

...read moreread less

Abstract: Systems, computer-implemented methods, and tangible computer-readable media for generating a pronunciation model. The method includes identifying a generic model of speech composed of phonemes, identifying a family of interchangeable phonemic alternatives for a phoneme in the generic model of speech, labeling the family of interchangeable phonemic alternatives as referring to the same phoneme, and generating a pronunciation model which substitutes each family for each respective phoneme. In one aspect, the generic model of speech is a vocal tract length normalized acoustic model. Interchangeable phonemic alternatives can represent a same phoneme for different dialectal classes. An interchangeable phonemic alternative can include a string of phonemes.

...read moreread less

Journal Article•DOI•

An improved string composition method for sequence comparison

[...]

Guoquing Lu¹, Shunpu Zhang², Xiang Fang²•Institutions (2)

University of Nebraska Omaha¹, University of Nebraska–Lincoln²

28 May 2008-BMC Bioinformatics

TL;DR: An improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison is proposed and is more robust compared with existing counterparts and comparable in robustness with alignment-based methods.

...read moreread less

Abstract: Historically, two categories of computational algorithms (alignment-based and alignment-free) have been applied to sequence comparison–one of the most fundamental issues in bioinformatics. Multiple sequence alignment, although dominantly used by biologists, possesses both fundamental as well as computational limitations. Consequently, alignment-free methods have been explored as important alternatives in estimating sequence similarity. Of the alignment-free methods, the string composition vector (CV) methods, which use the frequencies of nucleotide or amino acid strings to represent sequence information, show promising results in genome sequence comparison of prokaryotes. The existing CV-based methods, however, suffer certain statistical problems, thereby underestimating the amount of evolutionary information in genetic sequences. We show that the existing string composition based methods have two problems, one related to the Markov model assumption and the other associated with the denominator of the frequency normalization equation. We propose an improved complete composition vector method under the assumption of a uniform and independent model to estimate sequence information contributing to selection for sequence comparison. Phylogenetic analyses using both simulated and experimental data sets demonstrate that our new method is more robust compared with existing counterparts and comparable in robustness with alignment-based methods. We observed two problems existing in the currently used string composition methods and proposed a new robust method for the estimation of evolutionary information of genetic sequences. In addition, we discussed that it might not be necessary to use relatively long strings to build a complete composition vector (CCV), due to the overlapping nature of vector strings with a variable length. We suggested a practical approach for the choice of an optimal string length to construct the CCV.

...read moreread less

Formal Proof—Theory and Practice

[...]

John Harrison

01 Jan 2008

TL;DR: Syntax trees almost copying the abstract grammar: type term = Var of string | Fn of string * term list to express in a direct recursive way the function returning the set of variables in a term.

...read moreread less

Abstract: syntax trees almost copying the abstract grammar: type term = Var of string | Fn of string * term list;; and express in a direct recursive way the function returning the set of variables in a term: let rec vars tm =

...read moreread less

Collapse