scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 2004"


Journal ArticleDOI
TL;DR: STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface.
Abstract: A full description of a protein's function requires knowledge of all partner proteins with which it specifically associates. From a functional perspective, 'association' can mean direct physical binding, but can also mean indirect interaction such as participation in the same metabolic pathway or cellular process. Currently, information about protein association is scattered over a wide variety of resources and model organisms. STRING aims to simplify access to this information by providing a comprehensive, yet quality-controlled collection of protein-protein associations for a large number of organisms. The associations are derived from high-throughput experimental data, from the mining of databases and literature, and from predictions based on genomic context analysis. STRING integrates and ranks these associations by benchmarking them against a common reference set, and presents evidence in a consistent and intuitive web interface. Importantly, the associations are extended beyond the organism in which they were originally described, by automatic transfer to orthologous protein pairs in other organisms, where applicable. STRING currently holds 730,000 proteins in 180 fully sequenced organisms, and is available at http://string.embl.de/.

1,446 citations


Journal ArticleDOI
TL;DR: A class of string kernels, called mismatch kernels, are introduced for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection, where it is shown that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homological detection, particularly when very few training examples are available.
Abstract: Motivation: Classification of proteins sequences into functional and structural families based on sequence homology is a central problem in computational biology. Discriminative supervised machine learning approaches provide good performance, but simplicity and computational efficiency of training and prediction are also important concerns. Results: We introduce a class of string kernels, called mismatch kernels, for use with support vector machines (SVMs) in a discriminative approach to the problem of protein classification and remote homology detection. These kernels measure sequence similarity based on shared occurrences of fixed-length patterns in the data, allowing for mutations between patterns. Thus, the kernels provide a biologically well-motivated way to compare protein sequences without relying on family-based generative models such as hidden Markov models. We compute the kernels efficiently using a mismatch tree data structure, allowing us to calculate the contributions of all patterns occurring in the data in one pass while traversing the tree. When used with an SVM, the kernels enable fast prediction on test sequences. We report experiments on two benchmark SCOP datasets, where we show that the mismatch kernel used with an SVM classifier performs competitively with state-of-the-art methods for homology detection, particularly when very few training examples are available. Examination of the highest-weighted patterns learned by the SVM classifier recovers biologically important motifs in protein families and superfamilies. Availability: SVM software is publicly available at http://microarray.cpmc.columbia.edu/gist. Mismatch kernel software is available upon request.

602 citations


Book ChapterDOI
Tero Harju1
01 Jan 2004
TL;DR: Words (strings of symbols) are fundamental in computer processing, and nearly all computer software use algorithms on strings.
Abstract: Words (strings of symbols) are fundamental in computer processing. Indeed, each bit of data processed by a computer is a string, and nearly all computer software use algorithms on strings. There are also abundant supply of applications of these algorithms in other areas such as data compression, DNA sequence analysis, computer graphics, cryptography, and so on.

598 citations


Proceedings ArticleDOI
13 Jun 2004
TL;DR: This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly.
Abstract: This paper considers various flavors of the following online problem: preprocess a text or collection of strings, so that given a query string p, all matches of p with the text can be reported quickly. In this paper we consider matches in which a bounded number of mismatches are allowed, or in which a bounded number of "don't care" characters are allowed. The specific problems we look at are: indexing, in which there is a single text t, and we seek locations where p matches a substring of t; dictionary queries, in which a collection of strings is given upfront, and we seek those strings which match p in their entirety; and dictionary matching, in which a collection of strings is given upfront, and we seek those substrings of a (long) p which match an original string in its entirety. These are all instances of an all-to-all matching problem, for which we provide a single solution.The performance bounds all have a similar character. For example, for the indexing problem with n=|t| and m=|p|, the query time for k substitutions is O(m + (c1 log n)k⁄k! + # matches), with a data structure of size O(n (c2 log n)k⁄k!) and a preprocessing time of O(n (c2 log n)k⁄k!), where c1,c2 > 1 are constants. The deterministic preprocessing assumes a weakly nonuniform RAM model; this assumption is not needed if randomization is used in the preprocessing.

301 citations


Patent
Scott D. Sanders1
05 Feb 2004
TL;DR: In this article, a text search string can be normalized into searchable terms and the terms interpreted as either text search terms or attribute search terms and results are given a relevancy ranking according to the interpretation.
Abstract: Subject matter includes a search engine for electronic program guide (EPG) data and related methods. In an exemplary method, a text search string can be normalized into searchable terms and the terms interpreted as either text search terms or attribute search terms. One or more queries having search conditions of varying degrees of complexity are created according to the interpretation of the terms of the search string. One or more searches in EPG databases and/or web-resources are performed based on interpretation of the text and attribute terms and results are given a relevancy ranking according to the interpretation. The combined search results may be grouped, ranked, and filtered for display to the user. Results may also be displayed progressively as each character of a search string is entered by the user.

270 citations


Journal ArticleDOI
TL;DR: Algorithmic Clustering of Music Based on String Compression and its Applications.
Abstract: Cilibrasi, Vitanyi, and de Wolf Computer Music Journal, 28:4, pp 49–67, Winter 2004 2004 Massachusetts Institute of Technology Rudi Cilibrasi,* Paul Vitanyi,*† and Ronald de Wolf* *Centrum voor Wiskunde en Informatica Kruislaan 413 1098 SJ Amsterdam, The Netherlands †Institute for Logic, Language, and Computation University of Amsterdam Plantage Muidergracht 24 1018 TV Amsterdam, The Netherlands {RudiCilibrasi, PaulVitanyi, RonalddeWolf}@cwinl Algorithmic Clustering of Music Based on String Compression

247 citations


Patent
24 Nov 2004
TL;DR: In this article, a language processing system including a rules database and a meaning engine is presented, which is able to store a plurality of syntax rules and generate referent tridbits corresponding to stimuli in the information input string based on syntax rules.
Abstract: A method for processing natural language includes receiving an information input string. Referent tridbits corresponding to stimuli in the information input string are generated. Assert tridbits defining relationships between the referent tridbits are generated. A language processing system including a rules database and a meaning engine. The rules database is operable to store a plurality of syntax rules. The meaning engine is operable to receiving an information input string, generate referent tridbits corresponding to stimuli in the information input string based on the syntax rules, and generate assert tridbits defining relationships between the referent tridbits based on the syntax rules.

217 citations


Patent
Mong Suan Yee1
02 Sep 2004
TL;DR: The sphere decoder as discussed by the authors is a data structure configured to define, for each level of the search, a set of symbol values from which the postulated values are selected, the sets of symbols values being different at different levels of search.
Abstract: This invention is generally concerned with methods, apparatus and processor control code for decoding signals, in particular by means of sphere decoding. A sphere decoder configured to search for one or more strings of symbols less than a search bound from an input signal by establishing a value for each symbol in turn of a candidate the string by postulating values for each the symbol in turn of the candidate string and determining whether a postulated symbol value results in a distance metric dependent upon the search bound being satisfied, each the symbol of a candidate string for which values are postulated defining a level of the search. The sphere decoder includes a data structure configured to define, for each level of the search, a set of symbol values from which the postulated values are selected, the sets of symbol values being different at different levels of the search.

210 citations


PatentDOI
Kai-Fu Lee1, Zheng Chen1, Jian Han1
TL;DR: In this article, a language input architecture has a search engine, one or more typing models, a language model, and lexicons for different languages, which converts input strings of phonetic text to an output string of language text.
Abstract: A language input architecture converts input strings of phonetic text to an output string of language text. The language input architecture has a search engine, one or more typing models, a language model, and one or more lexicons for different languages. The typing model is configured to generate a list of probable typing candidates that may be substituted for the input string based on probabilities of how likely each of the candidate strings was incorrectly entered as the input string. The language model provides probable conversion strings for each of the typing candidates based on probabilities of how likely a probable conversion output string represents the candidate string. The search engine combines the probabilities of the typing and language models to find the most probable conversion string that represents a converted form of the input string.

188 citations


Proceedings ArticleDOI
22 Feb 2004
TL;DR: A novel linear-array string matching architecture using a buffered, two-comparator variation on the Knuth-Morris-Pratt (KMP) algorithm, proving the bound on the buffer size and running time, and providing performance comparisons against other approaches.
Abstract: Pattern matching for network security and intrusion detection demands exceptionally high performance. Much work has been done in this field, and yet there is still significant room for improvement in efficiency, flexibility, and throughput. We develop a novel linear-array string matching architecture using a buffered, two-comparator variation on the Knuth-Morris-Pratt(KMP) algorithm. For small (16 or fewer characters) patterns, it competes favorably with the state-of-the-art while providing better scalability and reconfiguration, and more efficient hardware utilization. The area efficiency compared to other approaches improves further still as the pattern size increases because only the tables increase in size.KMP is a well-known, efficient string matching technique using a single comparator and a precomputed transition table. We add a second comparator and an input buffer, allowing the system to accept at least one character in each cycle and terminate after a number of clock cycles at maximum equal to the length of the input string plus the size of the buffer. The system also provides a clean, modular route to reconfiguring the patterns on-the-fly and scaling the system to support more units, using several rows of linear array elements. In this paper, we prove the bound on the buffer size and running time, and provide performance comparisons against other approaches.

182 citations


Proceedings ArticleDOI
23 May 2004
TL;DR: A sound, static, program analysis technique to verify the correctness of dynamically generated query strings and describes the details of a prototype tool based on the algorithm and presents several illustrative defects found in senior software-engineering student-team projects, online tutorial examples, and a real-world purchase order system.
Abstract: Many data-intensive applications dynamically construct queries in response to client requests and execute them. Java servlets, e.g., can create string representations of SQL queries and then send the queries, using JDBC, to a database server for execution. The servlet programmer enjoys static checking via Java's strong type system. However, the Java type system does little to check for possible errors in the dynamically generated SQL query strings. Thus, a type error in a generated selection query (e.g., comparing a string attribute with an integer) can result in an SQL runtime exception. Currently, such defects must be rooted out through careful testing, or (worse) might be found by customers at runtime. In this paper, we present a sound, static, program analysis technique to verify the correctness of dynamically generated query strings. We describe our analysis technique and provide soundness results for our static analysis algorithm. We also describe the details of a prototype tool based on the algorithm and present several illustrative defects found in senior software-engineering student-team projects, online tutorial examples, and a real-world purchase order system written by one of the authors.

Journal ArticleDOI
TL;DR: An O(|S|)-time algorithm that operates on the suffix tree T(S) for a string S, finding and marking the endpoint of every tandem repeat that occurs in S, improves and generalizes several prior efforts to efficiently capture large subsets of tandem repeats.

Journal ArticleDOI
TL;DR: This algorithm is “lightweight” in the sense that it uses very small space in addition to the space required by the suffix array itself, and is fast even when the input contains many repetitions: this has been shown by extensive experiments with inputs of size up to 110 Mb.
Abstract: In this paper we describe a new algorithm for building the suffix array of a string. This task is equivalent to the problem of lexicographically sorting all the suffixes of the input string. Our algorithm is based on a new approach called deep–shallow sorting: we use a “shallow” sorter for the suffixes with a short common prefix, and a “deep” sorter for the suffixes with a long common prefix. All the known algorithms for building the suffix array either require a large amount of space or are inefficient when the input string contains many repeated substrings. Our algorithm has been designed to overcome this dichotomy. Our algorithm is “lightweight” in the sense that it uses very small space in addition to the space required by the suffix array itself. At the same time our algorithm is fast even when the input contains many repetitions: this has been shown by extensive experiments with inputs of size up to 110 Mb. The source code of our algorithm, as well as a C library providing a simple API, is available under the GNU GPL.

Proceedings ArticleDOI
21 Jul 2004
TL;DR: In this paper, the authors explore generalizations of ordinary parsing algorithms that allow the input to consist of string tuples and/or the grammar to range over strings. But their work is limited to syntactic parsing.
Abstract: In an ordinary syntactic parser, the input is a string, and the grammar ranges over strings. This paper explores generalizations of ordinary parsing algorithms that allow the input to consist of string tuples and/or the grammar to range over string tuples. Such algorithms can infer the synchronous structures hidden in parallel texts. It turns out that these generalized parsers can do most of the work required to train and apply a syntax-aware statistical machine translation system.

Patent
20 Apr 2004
TL;DR: In this paper, a processor for processing contents of packets passing through a connection point on a computer network is presented, which includes a searching apparatus having one or more comparators for searching for a reference string in the contents of a packet.
Abstract: A processor for processing contents of packets passing through a connection point on a computer network. The processor includes a searching apparatus having one or more comparators for searching for a reference string in the contents of a packet, and processes contents of all packets passing through the connection point in real time. In one implementation, the processor is programmable and has an instruction set that includes an instruction for invoking the searching apparatus to search for a specified reference string in the packet starting at an unknown location within a range of the packet.

Proceedings ArticleDOI
11 Jan 2004
TL;DR: A study of deletion rates for which one can successfully reconstruct the original string using a small number of samples, and a simple reconstruction algorithm called Bitwise Majority Alignment has an interesting self-correcting property whereby local distortions in the traces do not generate errors in the reconstruction and eventually get corrected.
Abstract: We are given a collection of m random subsequences (traces) of a string t of length n where each trace is obtained by deleting each bit in the string with probability q. Our goal is to exactly reconstruct the string t from these observed traces. We initiate here a study of deletion rates for which we can successfully reconstruct the original string using a small number of samples. We investigate a simple reconstruction algorithm called Bitwise Majority Alignment that uses majority voting (with suitable shifts) to determine each bit of the original string. We show that for random strings t, we can reconstruct the original string (w.h.p.) for q = O(1/ log n) using only O(log n) samples. For arbitrary strings t, we show that a simple modification of Bitwise Majority Alignment reconstructs a string that has identical structure to the original string (w.h.p.) for q = O(1/n1/2+e) using O(1) samples. In this case, using O(n log n) samples, we can reconstruct the original string exactly. Our setting can be viewed as the study of an idealized biological evolutionary process where the only possible mutations are random deletions. Our goal is to understand at what mutation rates, a small number of observed samples can be correctly aligned to reconstruct the parent string.In the process of establishing these results, we show that Bitwise Majority Alignment has an interesting self-correcting property whereby local distortions in the traces do not generate errors in the reconstruction and eventually get corrected.

Journal ArticleDOI
TL;DR: This work defines a word to be a meaningful string composed of several Chinese characters, considers the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and uses them as the measurement of the context independency of astring from the rest of the sentences in the document.
Abstract: We are interested in the problem of word extraction from Chinese text collections. We define a word to be a meaningful string composed of several Chinese characters. For example, 'percent', and, 'more and more', are not recognized as traditional Chinese words from the viewpoint of some people. However, in our work, they are words because they are very widely used and have specific meanings. We start with the viewpoint that a word is a distinguished linguistic entity that can be used in many different language environments. We consider the characters that are directly before a string (predecessors) and the characters that are directly after a string (successors) as important factors for determining the independence of the string. We call such characters accessors of the string, consider the number of distinct predecessors and successors of a string in a large corpus (TREC 5 and TREC 6 documents), and use them as the measurement of the context independency of a string from the rest of the sentences in the document. Our experiments confirm our hypothesis and show that this simple rule gives quite good results for Chinese word extraction and is comparable to, and for long words outperforms, other iterative methods.

Book ChapterDOI
19 Feb 2004
TL;DR: A constant round protocol for Oblivious Transfer in Maurer's bounded storage model that has only 5 messages and uses constructions of almost t-wise independent permutations, randomness extractors and averaging samplers from the theory of derandomization.
Abstract: We present a constant round protocol for Oblivious Transfer in Maurer's bounded storage model In this model, a long random string R is initially transmitted and each of the parties interacts based on a small portion of R Even though the portions stored by the honest parties are small, security is guaranteed against any malicious party that remembers almost all of the string R Previous constructions for Oblivious Transfer in the bounded storage model required polynomially many rounds of interaction Our protocol has only 5 messages We also improve other parameters, such as the number of bits transferred and the probability of immaturely aborting the protocol due to failure Our techniques utilize explicit constructions from the theory of derandomization In particular, we use constructions of almost t-wise independent permutations, randomness extractors and averaging samplers

Proceedings ArticleDOI
06 May 2004
TL;DR: This work presents a general algorithm for the indexation of weighted automata and introduces a general framework based on weighted transducers that generalizes this indexation to enable the search for more complex patterns including syntactic information or for different types of sequences, e.g., word sequences instead of phonemic sequences.
Abstract: Much of the massive quantities of digitized data widely available, e.g., text, speech, hand-written sequences, are either given directly, or, as a result of some prior processing, as weighted automata. These are compact representations of a large number of alternative sequences and their weights reflecting the uncertainty or variability of the data. Thus, the indexation of such data requires indexing weighted automata. We present a general algorithm for the indexation of weighted automata. The resulting index is represented by a deterministic weighted transducer that is optimal for search: the search for an input string takes time linear in the sum of the size of that string and the number of indices of the weighted automata where it appears. We also introduce a general framework based on weighted transducers that generalizes this indexation to enable the search for more complex patterns including syntactic information or for different types of sequences, e.g., word sequences instead of phonemic sequences. The use of this framework is illustrated with several examples. We applied our general indexation algorithm and framework to the problem of indexation of speech utterances and report the results of our experiments in several tasks demonstrating that our techniques yield comparable results to previous methods, while providing greater generality, including the possibility of searching for arbitrary patterns represented by weighted automata.

Book ChapterDOI
31 Aug 2004
TL;DR: This paper reports on the techniques and experience in dealing with flexible string matching against real AT&T databases, and identifies various performance enhancements to speed up the matching process.
Abstract: Data Cleaning is an important process that has been at the center of research interest in recent years. Poor data quality is the result of a variety of reasons, including data entry errors and multiple conventions for recording database fields, and has a significant impact on a variety of business issues. Hence, there is a pressing need for technologies that enable flexible (fuzzy) matching of string information in a database. Cosine similarity with tf-idf is a well-established metric for comparing text, and recent proposals have adapted this similarity measure for flexibly matching a query string with values in a single attribute of a relation. In deploying tf-idf based flexible string matching against real AT&T databases, we observed that this technique needed to be enhanced in many ways. First, along the functionality dimension, where there was a need to flexibly match along multiple string-valued attributes, and also take advantage of known semantic equivalences. Second, we identified various performance enhancements to speed up the matching process, potentially trading off a small degree of accuracy for substantial performance gains. In this paper, we report on our techniques and experience in dealing with flexible string matching against real AT&T databases.

Patent
28 Jan 2004
TL;DR: In this paper, the authors consider the problem of sending personal data to a recipient using both a public data item provided by a trusted party and an encryption key string formed using at least policy data indicative of conditions to be satisfied before access is given to the personal data.
Abstract: When sending personal data to a recipient, the data owner encrypts the data using both a public data item provided by a trusted party and an encryption key string formed using at least policy data indicative of conditions to be satisfied before access is given to the personal data. The encryption key string is typically also provided to the recipient along with the encrypted personal data. To decrypt the personal data, the recipient sends the encryption key string to the trusted party with a request for the decryption key. The trusted party determines the required decryption key using the encryption key string and private data used in deriving its public data, and provides it to the requesting recipient. However, the decryption key is either not determined or not made available until the trusted party is satisfied that the associated policy conditions have been met by the recipient.

Patent
17 Dec 2004
TL;DR: In this article, a string replacement is performed in text using linguistic processing, which identifies the existence of direct or indirect links between the string to be replaced and other strings in the text.
Abstract: String replacement is performed in text using linguistic processing. The linguistic processing identifies the existence of direct or indirect links between the string to be replaced and other strings in the text. Morphological, syntactic, anaphoric, or semantic inconsistencies, which are introduced in strings with the identified direct or indirect links to the string that is to be replaced are detected and corrected.

Patent
01 Apr 2004
TL;DR: In this article, a user views access strings for available data services, selects the access string for the desired data service, and returns the selected access string, which is associated with one or more profiles, each profile includes various parameters needed to establish a specific data call, and each profile is further associated with an activation string that contains connection information for the data call.
Abstract: Techniques for performing system selection based on a usage model that uses “access strings”, “profiles”, and “activation strings” are described. Access strings are defined for wireless data services and provide a highly intuitive user interface. Each access string is associated with one or more profiles. Each profile includes various parameters needed to establish a specific data call. Each profile is further associated with an activation string that contains connection information for the data call. System selection is performed in two parts. In the first part, a wireless user views access strings for available data services, selects the access string for the desired data service, and returns the selected access string. In the second part, the wireless device selects a profile for a system most suited to provide the desired data service, from among all profiles associated with the selected access string.

Book ChapterDOI
20 Dec 2004
TL;DR: The minimum common string partition problem (MCSP) is addressed, a string comparison problem with tight connection to the problem of sorting by reversals with duplicates, a key problem in genome rearrangement and 2-MCSP is shown to be NP-hard and even APX-hard.
Abstract: String comparison is a fundamental problem in computer science, with applications in areas such as computational biology, text processing or compression In this paper we address the minimum common string partition problem, a string comparison problem with tight connection to the problem of sorting by reversals with duplicates, a key problem in genome rearrangement. A partition of a string A is a sequence ${\mathcal P}=(P_{1},P_{2},...P_{m})$ of strings, called the blocks, whose concatenation is equal to A Given a partition ${\mathcal P}$ of a string A and a partition ${\mathcal Q}$ of a string B, we say that the pair $\langle\mathcal{P,Q}\rangle$ is a common partition of A and B if ${\mathcal Q}$ is a permutation of ${\mathcal P}$ The minimum common string partition problem (MCSP) is to find a common partition of two strings A and B with the minimum number of blocks The restricted version of MCSP where each letter occurs at most k times in each input string, is denoted by k-MCSP. In this paper, we show that 2-MCSP (and therefore MCSP) is NP-hard and, moreover, even APX-hard We describe a 1.1037-approximation for 2-MCSP and a linear time 4-approximation algorithm for 3-MCSP We are not aware of any better approximations.

Proceedings ArticleDOI
26 Oct 2004
TL;DR: A system for online recognition of handwritten Tamil characters is presented and a structure- or shape-based representation of a strokes is used in which a stroke is represented as a string of shape features.
Abstract: A system for online recognition of handwritten Tamil characters is presented. A handwritten character is constructed by executing a sequence of strokes. A structure- or shape-based representation of a stroke is used in which a stroke is represented as a string of shape features. Using this string representation, an unknown stroke is identified by comparing it with a database of strokes using a flexible string matching procedure. A full character is recognized by identifying all the component strokes. Character termination, is determined using a finite state automaton. Development of similar systems for other Indian scripts is outlined.

Patent
29 Mar 2004
TL;DR: In this article, the authors describe a specific application of block cipher cryptography, where the digital content is encrypted using an encryption key and a calculated initialization vector, and the initialization vector is calculated by performing an exclusive disjunction function on a seed value and the string of data for each stride.
Abstract: Protection of digital content using a specific application of block cipher cryptography is described. The digital content is encrypted using an encryption key and a calculated initialization vector. The digital content includes a plurality of strides of data and each stride includes a string of data to be encrypted and a block of data to be encrypted. The calculated initialization vector to be used to encrypt the block of data is derived from the string of data in the stride to be encrypted. Furthermore, the initialization vector is calculated by performing an exclusive disjunction function on a seed value and the string of data for each stride.

Book ChapterDOI
19 Feb 2004
TL;DR: The use of random oracles to achieve universal composability of commitment protocols is motivated and two constructions are given which allow to turn a given non-interactive commitment scheme into a non-Interactive universally composable commitment scheme in the random oracle model.
Abstract: In the setting of universal composability [Can01], commitments cannot be implemented without additional assumptions such as that of a publicly available common reference string[CF01]. Here, as an alternative to the commitments in the common reference string model, the use of random oracles to achieve universal composability of commitment protocols is motivated. Special emphasis is put on the security in the situation when the additional “helper functionality” is replaced by a realizable primitive. This contribution gives two constructions which allow to turn a given non-interactive commitment scheme into a non-interactive universally composable commitment scheme in the random oracle model. For both constructions the binding and the hiding property remain valid when collision-free hash functions are used instead of random oracles. Moreover the second construction in this case even preserves the property of perfect binding.

Book ChapterDOI
31 Aug 2004
TL;DR: This paper presents a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature.
Abstract: Large string datasets are common in a number of emerging text and biological database applications. Common queries over such datasets include both exact and approximate string matches. These queries can be evaluated very efficiently by using a suffix tree index on the string dataset. Although suffix trees can be constructed quickly in memory for small input datasets, constructing persistent trees for large datasets has been challenging. In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) complexity outperforms the popular O(n) Ukkonen algorithm, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we present a buffer management strategy for the O(n2) algorithm, creating a new disk-based construction algorithm that scales to sizes much larger than have been previously described in the literature. Our approach far outperforms the best known disk-based construction algorithms.

Patent
10 Dec 2004
TL;DR: In this paper, a system and methods for validating the authenticity of a signature on a document by providing a document from an account, the document including an actual signature and a machinereadable identifier, wherein the machine-readable identifier contains a string of data representing the integral characteristics of all valid account signatures and a person-specific confidence threshold.
Abstract: Systems and methods are provided for validating the authenticity of a signature on a document by providing a document from an account, the document including an actual signature and a machine-readable identifier, wherein the machine-readable identifier contains a string of data representing the integral characteristics of all valid account signatures and a person-specific confidence threshold. When the document is presented at a point of presentment, the document is scanned into a document-processing machine and the actual signature is compared against all valid account signatures.

Patent
03 Dec 2004
TL;DR: In this paper, the present invention relates to novel methods for generating variant proteins with increased host string content, and proteins that are engineered using these methods, and is related to our work.
Abstract: The present invention relates to novel methods for generating variant proteins with increased host string content, and proteins that are engineered using these methods.