scispace - formally typeset
Search or ask a question

Showing papers on "String (computer science) published in 1996"


Patent
Julian M. Kupiec1
08 May 1996
TL;DR: In this paper, a computerized method for retrieving documents from a text corpus in response to a user-supplied natural language input string, e.g., a question, is presented.
Abstract: A computerized method for retrieving documents from a text corpus in response to a user-supplied natural language input string, e.g., a question. An input string is accepted and analyzed to detect phrases therein. A series of queries based on the detected phrases is automatically constructed through a sequence of successive broadening and narrowing operations designed to generate an optimal query or queries. The queries of the series are executed to retrieve documents, which are then ranked and made available for output to the user, a storage device, or further processing. In another aspect the method is implemented in the context of a larger two-phase method, of which the first phase comprises the method of the invention and the second phase of the method comprises answer extraction.

527 citations


Journal ArticleDOI
TL;DR: It has been shown that an EGA converges to the global optimal solution with any choice of initial population, and mutation operation has been found to be essential for convergence.
Abstract: In this article, the genetic algorithm with elitist model (EGA) is modeled as a finite state Markov chain. A state in the Markov chain denotes a population together with a potential string. Proof for the convergence of an EGA to the best chromosome (string), among all possible chromosomes, is provided here. Mutation operation has been found to be essential for convergence. It has been shown that an EGA converges to the global optimal solution with any choice of initial population.

199 citations


Journal ArticleDOI
TL;DR: A lexicon-based, handwritten word recognition system combining segmentation-free and segmentations-based techniques is described that uses dynamic programming to match word images and strings.
Abstract: A lexicon-based, handwritten word recognition system combining segmentation-free and segmentation-based techniques is described. The segmentation-free technique constructs a continuous density hidden Markov model for each lexicon string. The segmentation-based technique uses dynamic programming to match word images and strings. The combination module uses differences in classifier capabilities to achieve significantly better performance.

193 citations


Journal Article
Kemal Oflazer1
TL;DR: In this paper, error-tolerant recognition with finite-state recognizers has been used for morphological analysis of Turkish words and for spelling correction in English, Dutch, French, German, and Italian.
Abstract: This paper presents the notion of error-tolerant recognition with finite-state recognizers along with results from some applications. Error-tolerant recognition enables the recognition of strings that deviate mildly from any string in the regular set recognized by the underlying finite-state recognizer. Such recognition has applications to error-tolerant morphological processing, spelling correction, and approximate string matching in information retrieval. After a description of the concepts and algorithms involved, we give examples from two applications: in the context of morphological analysis, error-tolerant recognition allows misspelled input word forms to be corrected and morphologically analyzed concurrently. We present an application of this to error-tolerant analysis of the agglutinative morphology of Turkish words. The algorithm can be applied to morphological analysis of any language whose morphology has been fully captured by a single (and possibly very large) finite-state transducer, regardless of the word formation processes and morphographemic phenomena involved. In the context of spelling correction, error-tolerant recognition can be used to enumerate candidate correct forms from a given misspelled string within a certain edit distance. Error-tolerant recognition can be applied to spelling correction for any language, if (a) it has a word list comprising all inflected forms, or (b) its morphology has been fully described by a finite-state transducer. We present experimental results for spelling correction for a number of languages. These results indicate that such recognition works very efficiently for candidate generation in spelling correction for many European languages (English, Dutch, French, German, and Italian, among others) with very large word lists of root and inflected forms (some containing well over 200,000 forms), generating all candidate solutions within 10 to 45 milliseconds (with an edit distance of 1) on a SPARCStation 10/41. For spelling correction in Turkish, error-tolerant recognition operating with a (circular) recognizer of Turkish words (with about 29,000 states and 119,000 transitions) can generate all candidate words in less than 20 milliseconds, with an edit distance of 1.

183 citations


Journal ArticleDOI
TL;DR: It is proved that a simple algorithm can construct second-order recurrent neural networks with a sparse interconnection topology and sigmoidel discriminant function such that the internal DFA state representations are stable, that is, the constructed network correctly classifies strings of arbitrary length.
Abstract: Recurrent neural networks that are trained to behave like deterministic finite-state automata (DFAs) can show deteriorating performance when tested on long strings. This deteriorating performance can be attributed to the instability of the internal representation of the learned DFA states. The use of a sigmoidel discriminant function together with the recurrent structure contribute to this instability. We prove that a simple algorithm can construct second-order recurrent neural networks with a sparse interconnection topology and sigmoidal discriminant function such that the internal DFA state representations are stable, that is, the constructed network correctly classifies strings of arbitrary length. The algorithm is based on encoding strengths of weights directly into the neural network. We derive a relationship between the weight strength and the number of DFA states for robust string classification. For a DFA with n state and minput alphabet symbols, the constructive algorithm generates a “programmed” neural network with O(n) neurons and O(mn) weights. We compare our algorithm to other methods proposed in the literature.

178 citations


Journal Article
TL;DR: In this article, an efficient approach for real-time synthesis of plucked string instruments using physical modeling and DSP techniques is presented, and results of model-based resynthesis are illustrated to demonstrate that high-quality synthetic sounds of several string instruments can be generated using the proposed modeling principles.
Abstract: An efficient approach for real-time synthesis of plucked string instruments using physical modeling and DSP techniques is presented Results of model-based resynthesis are illustrated to demonstrate that high-quality synthetic sounds of several string instruments can be generated using the proposed modeling principles Real-time implementation using a signal processor is described, and several aspects of controlling physical models of plucked string instruments are studied

176 citations


Journal ArticleDOI
TL;DR: This work shows how to simulate BPP and approximation algorithms in polynomial time using the output from a δ-source, and gives an application to the unapproximability of MAX CLIQUE.
Abstract: We show how to simulate BPP and approximation algorithms in polynomial time using the output from a δ-source. A δ-source is a weak random source that is asked only once forR bits, and must output anR-bit string according to some distribution that places probability no more than 2−δR on any particular string. We also give an application to the unapproximability of MAX CLIQUE.

171 citations


Journal ArticleDOI
Mehryar Mohri1
TL;DR: New applications of the theory of automata to natural language processing: the representation of very large scale dictionaries and the indexation of natural language texts are described based on new algorithms that are introduced and described in detail.
Abstract: We describe new applications of the theory of automata to natural language processing: the representation of very large scale dictionaries and the indexation of natural language texts. They are based on new algorithms that we introduce and describe in detail. In particular, we give pseudocodes for the determinisation of string to string transducers, the deterministic union of p-subsequential string to string transducers, and the indexation by automata. We report on several experiments illustrating the applications.

167 citations


Patent
02 Oct 1996
TL;DR: In this article, a speech recognition system capable of recognizing a word or a plurality of words based on a continuous spelling of the word(s) by a user is presented, which includes a decoder running in forward mode such that the recognition engine continuously outputs an updated string of hypothesized letters based on the letters uttered by the user.
Abstract: A speech recognition system capable of recognizing a word or a plurality of words based on a continuous spelling of the word(s) by a user. The system includes a speech recognition engine with a decoder running in forward mode such that the recognition engine continuously outputs an updated string of hypothesized letters based on the letters uttered by the user. The system further includes a spelling engine for comparing each string of hypothesized letters to a vocabulary list of words. The spelling engine returns a best match for the string of hypothesized letters. The system may also include an early identification unit for presenting the user with the best matching word(s) possibly before the user has completed spelling the desired word(s).

153 citations


Patent
Eric M. Visser1
21 Jun 1996
TL;DR: A character string correction system corrects a spelling error in a character string input through the keyboard, OCT, etc. as discussed by the authors, where an error pattern representing frequent occurrences of errors is preliminarily set and stored in the memory.
Abstract: A character string correction system corrects a spelling error in a character string input through the keyboard, OCT, etc. An error pattern representing frequent occurrences of errors is preliminarily set and stored in the memory, etc. A processor reads an input character string character by character, and compares the read character with the error pattern. If the input character string matches an error pattern, it is assumed that an error exists. The input character is replaced with one of the alternative characters. Using the input character string or the character string corrected with an alternative character, a dictionary (TRIE table) is searched. If a corresponding word is detected in the dictionary, the word is output as one of the recognition results.

147 citations


Proceedings ArticleDOI
05 Aug 1996
TL;DR: A new algorithm is proposed that simultaneously identifies the coding system and language of a code string fetched from the Internet, especially World-Wide Web.
Abstract: This paper proposes a new algorithm that simultaneously identifies the coding system and language of a code string fetched from the Internet, especially World-Wide Web. The algorithm uses statistic language models to select the correctly decoded string as well as to determine the language. The proposed algorithm covers 9 languages and 11 coding systems used in Eastern Asia and Western Europe. Experimental results show that the level of accuracy of our algorithm is over 95% for 640 on-line documents.

Proceedings ArticleDOI
Martin Kay1
24 Jun 1996
TL;DR: Charts constitute a natural uniform architecture for parsing and generation provided string position is replaced by a notion more appropriate to logical forms and that measures are taken to curtail generation paths containing semantically incomplete phrases.
Abstract: Charts constitute a natural uniform architecture for parsing and generation provided string position is replaced by a notion more appropriate to logical forms and that measures are taken to curtail generation paths containing semantically incomplete phrases.

Patent
28 Mar 1996
TL;DR: In this paper, an apparatus and method for obtaining samples of pristine formation fluid, using a work string (6) designed for performing other downhole work such as drilling, workover operations, or re-entry operations.
Abstract: An apparatus and method are disclosed for obtaining samples of pristine formation fluid, using a work string (6) designed for performing other downhole work such as drilling, workover operations, or re-entry operations. An extendable element (24, 26, 45) extends against the formation wall to obtain the pristine fluid sample. While the test tool (16) is in a standby condition, the extendable element (24, 26, 45) is withdrawn within the work string, protected by other structure from damage during operation of the work string (6). The apparatus is used to sense downhole conditions while using a work string (6), and the measurements taken can be used to adjust working fluid properties without withdrawing the work string (6) from the bore hole (4). When the extendable element (24, 26, 45) is a packer (24, 26), the apparatus can be used to prevent a kick from reaching the surface, adjust the density of the drilling fluid, and thereafter continuing use of the work string.

Journal ArticleDOI
TL;DR: An algorithm for finding probably correct alignments on the basis of phonetic similarity, consisting of an evaluation metric and a guided search procedure, that can be extended to implement special handling of metathesis, assimilation, or other phenomena that require looking ahead in the string.
Abstract: The first step in applying the comparative method to a pair of words suspected of being cognate is to align the segments of each word that appear to correspond. Finding the right alignment may require searching. For example, Latin do 'I give' lines up with the middle do in Greek didomi, not the initial di.This paper presents an algorithm for finding probably correct alignments on the basis of phonetic similarity. The algorithm consists of an evaluation metric and a guided search procedure. The search algorithm can be extended to implement special handling of metathesis, assimilation, or other phenomena that require looking ahead in the string, and can return any number of alignments that meet some criterion of goodness, not just the one best. It can serve as a front end to computer implementations of the comparative method.

Proceedings ArticleDOI
14 Oct 1996
TL;DR: The authors show that this general method based on assigning labels to some of the substrings of a given string is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing.
Abstract: A key approach in string processing algorithmics has been the labeling paradigm which is based on assigning labels to some of the substrings of a given string. If these labels are chosen consistently, they can enable fast comparisons of substrings. Until the first optimal parallel algorithm for suffix tree construction was given by the authors in 1994 the labeling paradigm was considered not to be competitive with other approaches. They show that this general method is also useful for several central problems in the area of string processing: approximate string matching, dynamic dictionary matching, and dynamic text indexing. The approximate string matching problem deals with finding all substrings of a text which match a pattern "approximately", i.e., with at most m differences. The differences can be in the form of inserted, deleted, or replaced characters. The text indexing problem deals with finding all occurrences of a pattern in a text, after the text is preprocessed. In the dynamic text indexing problem, updates to the text in the form of insertions and deletions of substrings are permitted. The dictionary matching problem deals with finding all occurrences of each pattern set of a set of patterns in a text, after the pattern set is preprocessed. In the dynamic dictionary matching problem, insertions and deletions of patterns to the pattern set are permitted.

Proceedings ArticleDOI
Richard Sproat1, Michael Riley2
24 Jun 1996
TL;DR: A method for compiling decision trees into weighted finite-state transducers using the weighted rewite-rule rule-compilation algorithm described in (Mohri and Sproat, 1996).
Abstract: We report on a method for compiling decision trees into weighted finite-state transducers. The key assumptions are that the tree predictions specify how to rewrite symbols from an input string, and the decision at each tree node is stateable in terms of regular expressions on the input string. Each leaf node can then be treated as a separate rule where the left and right contexts are constructable from the decisions made traversing the tree from the root to the leaf. These rules are compiled into transducers using the weighted rewite-rule rule-compilation algorithm described in (Mohri and Sproat, 1996).

Journal ArticleDOI
TL;DR: This work presents an algorithm for string matching with mismatches based in arithmetical operations that runs in linear worst case time for most practical cases and presents a new approach to string searching.

Journal ArticleDOI
TL;DR: Genetic algorithms are a general class of search methods that mimic natural gene-based optimization mechanisms that perform mutation, cross-over and replication operations on strings when applied to structure prediction.

Book ChapterDOI
03 Jul 1996
TL;DR: In this paper, it was shown that if the input texts are given by their Lempel-Ziv codes then the problems can be solved deterministically in polynomial time in the case when the original (uncompressed) texts are of exponential size.
Abstract: We consider several basic problems for texts and show that if the input texts are given by their Lempel-Ziv codes then the problems can be solved deterministically in polynomial time in the case when the original (uncompressed) texts are of exponential size. The growing importance of massively stored information requires new approaches to algorithms for compressed texts without decompressing. Denote by LZ(ω) the version of a string ω produced by Lempel-Ziv encoding algorithm. For given compressed strings LZ(T), LZ(P) we give the first known deterministic polynomial time algorithms to compute compressed representations of the set of all occurrences of the patternP in T, all periods of T, all palindromes of T, and all squares of T. Then we consider several classical language recognition problems:

Patent
08 Nov 1996
TL;DR: In this article, a dynamic selectable language display system for object oriented database management systems is presented, in which a representation of a class object can be simultaneously displayed to a plurality of users in different languages based upon a language handle individually selectable by each of said plurality.
Abstract: A dynamically selectable language display system for object oriented database management systems is disclosed. Class objects are provided having international string parameters that include a pointer to an international string list, the international string list including a language handle structure linked to a plurality of character strings in different languages. A handle manager is provided which is operative to select a character string corresponding to one of said plurality of character strings for display which corresponds to a dynamically selectable user specified language handle, whereby a representation of said class object may be simultaneously displayed to a plurality of users in different languages based upon a language handle individually selectable by each of said plurality of users.

Patent
Hiyan Alshawi1
10 Apr 1996
TL;DR: This article used a plurality of probabilistic finite state machines having the ability to recognize a pair of sequences, one sequence scanned leftwards, the other scanned rightwards, and incrementally calculate costs related to the probability that such phrases represent the language to be recognized.
Abstract: Methods and apparatus for a language model and language recognition systems are disclosed. The method utilizes a plurality of probabilistic finite state machines having the ability to recognize a pair of sequences, one sequence scanned leftwards, the other scanned rightwards. Each word in the lexicon of the language model is associated with one or more such machines which model the semantic relations between the word and other words. Machine transitions create phrases from a set of word string hypotheses, and incrementally calculate costs related to the probability that such phrases represent the language to be recognized. The cascading lexical head machines utilized in the methods and apparatus capture the structural associations implicit in the hierachical organization of a sentence, resulting in a language model and language recognition systems that combine the lexical sensitivity of N-gram models with the structural properties of dependency grammar.

Journal ArticleDOI
TL;DR: It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.
Abstract: Experimental comparisons of the running time of approximate string matching algorithms for the k differences problem are presented. Given a pattern string, a text string, and an integer k, the task is to find all approximate occurrences of the pattern in the text with at most k differences (insertions, deletions, changes). We consider seven algorithms based on different approaches including dynamic programming, Boyer-Moore string matching, suffix automata, and the distribution of characters. It turns out that none of the algorithms is the best for all values of the problem parameters, and the speed differences between the methods can be considerable.

Patent
16 Sep 1996
TL;DR: In this paper, a method and apparatus for natural language parsing are described, which includes the steps of retrieving an input string, and performing a dictionary look-up for each word in the input string to form a correspondence between each word and a dictionary entry.
Abstract: A method and apparatus for natural language parsing are described. The invention includes the steps of retrieving an input string, and performing a dictionary look-up for each word in the input string to form a correspondence between each word and a dictionary entry. The dictionary entry provides lexical features of the word. The invention includes the additional step of processing the words in the input string beginning with a last word in the input string and continuing toward the first word in the input string. This step includes the step of associating a selected word in the input string with a word located to the left of the selected word in the input string to form a word phrase. The associating step is performed according to predetermined selection restriction rules. The steps of processing the words and associating a selected word are repeated until all words of the input string have been processed.

Jon McCormack1
01 Apr 1996
TL;DR: This paper adapt string rewriting grammars based on L-Systems into a system for music composition with greater flexibility than previous composition models based on finite state automata or Petri nets.
Abstract: L-Systems have traditionally been used as a popular method for the modelling of spacefilling curves, biological systems and morphogenesis. In this paper, we adapt string rewriting grammars based on L-Systems into a system for music composition. Representation of pitch, duration and timbre are encoded as grammar symbols, upon which a series of re-writing rules are applied. Parametric extensions to the grammar allow the specification of continuous data for the purposes of modulation and control. Such continuous data is also under control of the grammar. Using non-deterministic grammars with context sensitivity allows the simulation of Nth-order Markov models with a more economical representation than transition matrices and greater flexibility than previous composition models based on finite state automata or Petri nets. Using symbols in the grammar to represent relationships between notes, (rather than absolute notes) in combination with a hierarchical grammar representation, permits the emergence of complex music compositions from a relatively simple grammars.

Patent
15 Oct 1996
TL;DR: In this paper, the bias current is adjusted such that the voltage drop across the whole of the second resistor string is equal to the voltage dropping across any one first resistor within the first resistor string.
Abstract: In a dual resistor string digital-to-analog converter, current biasing is used to isolate a first resistor string from a second resistor string. The first resistor string consist of multiple first resistors, and a first switch network responsive to the MSBs selectively couples the second resistor string in parallel to any one first resistor within the first resistor string. To prevent the second resistor string from drawing current from the first resistor string, a current source feeds a bias current into the second resistor string and a current drain draws the bias current from the second resistor string. The bias current is adjusted such that the voltage drop across the whole of the second resistor string is equal to the voltage drop across any one first resistor within the first resistor string. Use of a current source and current drain allows one to freely adjust the number of MSBs, LSBs and first and second resistor magnitudes to obtain optimum performance without concern for any adverse nonlinearity effects.

01 Jan 1996
TL;DR: Levenshtein distance is applied to pronunciations to overcome difficulties in dealing with partial matches of features and with nonoverlapping language patterns and the result accords with traditonal dialectology to a satisfying degree.
Abstract: Traditional dialectology relies on identifying language features which are common to one dialect area while distinguishing it from others. It has difficulty in dealing with partial matches of features and with nonoverlapping language patterns. This paper applies Levenshtein distance—a measure of string distance—to pronunciations to overcome both of these difficulties. Partial matches may be quantified, and nonoverlapping patterns may be included in weighted averages of phonetic distance. The result accords with traditonal dialectology to a satisfying degree.

Book ChapterDOI
30 Jul 1996
TL;DR: This paper defines a notion of alignment between two RNA strings and presents a method for optimally aligning a given RNA sequence with unknown secondary structure to one with known sequence and structure, thus attacking the structure prediction problem in the case when the structure of a closely related sequence is known.
Abstract: Ribonucleic acid (RNA) strings are strings over the four-letter alphabet {A,C,G,U} with a secondary structure of base-pairing between A-U and C-G pairs in the string Edges are drawn between two bases that are paired in the secondary structure and these edges have traditionally been assumed to be noncrossing The noncrossing base-pairing naturally leads to a tree-like representation of the secondary structure of RNA strings In this paper, we address several notions of similarity between two RNA strings that take into account both the primary sequence and secondary base-pairing structure of the strings We present efficient algorithms for exact matching and approximate matching between two RNA strings We define a notion of alignment between two RNA strings and devise algorithms based on dynamic programming We then present a method for optimally aligning a given RNA sequence with unknown secondary structure to one with known sequence and structure, thus attacking the structure prediction problem in the case when the structure of a closely related sequence is known The techniques employed to prove our results include reductions to well-known string matching problems, allowing wild cards and ranges, and speeding up dynamic programming by using the tree structures implicit in the secondary structure of RNA strings

Patent
18 Dec 1996
TL;DR: A conditional transition network for representing a domain of knowledge in a computer based system and computational procedures for use with the same is presented in this article, where each node of the network comprises a number of data fields, namely: a Precondition Field which contains an expression that evaluates to either TRUE, POSSIBLE or FALSE; a Question Field consisting of a linguistic string; an Answer Field whose expression can evaluate to any numerical or string domain, a Contents Field from a textual, visual, audio, or multimedia domain; and several other fields including a Delay Field.
Abstract: A conditional transition network for representing a domain of knowledge in a computer based system and computational procedures for use with the same is here presented. Each node of the network comprises a number of data fields, namely: a Precondition Field which contains an expression that evaluates to either TRUE, POSSIBLE or FALSE; a Question Field consisting of a linguistic string; an Answer Field whose expression can evaluate to any numerical or string domain, a Contents Field from a textual, visual, audio, or multimedia domain; and several other fields including a Delay Field. The edges of the network are induced by the precondition and answer formula and by edges embedded in the contents. If the precondition of node A refers to node B, then there is a precondition edge from B to A. Similarly, if the answer formula of node A refers to node B, then there is an answer edge from node B to node A. Edges from the Contents Field of one node to the Contents Field of another node are called hypermedia edges or links. They also may have predicates attached to them. When the querier of the network answers questions, various nodes change their precondition values among the values TRUE, POSSIBLE and FALSE. TRUE nodes correspond to those the querier should examine further. FALSE ones correspond to those of no further interest. POSSIBLE ones correspond to those that may or may not be of further interest.

Patent
16 Feb 1996
TL;DR: In this paper, the problem of detecting a precise input coordinate even if the S/N of a coordinate detection signal is low is solved by using the digital data string of a noise component.
Abstract: PROBLEM TO BE SOLVED: To detect a precise input coordinate even if the S/N of a coordinate detection signal is low. SOLUTION: First and second switch means 34 and 35 scan first and second electrodes 31 and 32 without installing an indication means 40 on a coordinate input panel 33. X and Y signal amplification means 37 and 38 amplify X and Y signals and remove AC components. A coordinate operation means 42 sample- holds coordinate detection voltage and A/D convert it. An obtained digital value is stored in a storage means 41 as the digital data string of a noise component. At the time of inputting the coordinate, the digital data string when the indication means 41 exists is similarly obtained. The subtracter of the coordinate operation means 42 subtracts the digital data string of the noise component stored in the storage means 41. The digital data string of coordinate detection voltage which does not contain the noise component is obtained and the precise input coordinate is obtained based on the digital data string.

Journal ArticleDOI
TL;DR: In this article, a mass-spring-damper system is used as a simplified framework for string stability analysis and study the properties of several longitudinal control schemes which have been proposed in the literature.