Showing papers on "String (computer science) published in 2005"

PDF

Open Access

Proceedings Article•DOI•

Clause Restructuring for Statistical Machine Translation

[...]

Michael Collins¹, Philipp Koehn², Ivona Kučerová¹•Institutions (2)

Massachusetts Institute of Technology¹, University of Edinburgh²

25 Jun 2005

TL;DR: The reordering approach is applied as a pre-processing step in both the training and decoding phases of a phrase-based statistical MT system, showing an improvement from 25.2% Bleu score for a baseline system to 26.8% Blee score for the system with reordering.

...read moreread less

Abstract: We describe a method for incorporating syntactic information in statistical machine translation systems. The first step of the method is to parse the source language string that is being translated. The second step is to apply a series of transformations to the parse tree, effectively reordering the surface string on the source language side of the translation system. The goal of this step is to recover an underlying word order that is closer to the target language word-order than the original string. The reordering approach is applied as a pre-processing step in both the training and decoding phases of a phrase-based statistical MT system. We describe experiments on translation from German to English, showing an improvement from 25.2% Bleu score for a baseline system to 26.8% Bleu score for the system with reordering, a statistically significant improvement.

...read moreread less

679 citations

Patent•

Variable programming of non-volatile memory

[...]

Jian Chen¹, Chi-Ming Wang¹•Institutions (1)

SanDisk¹

23 Mar 2005

TL;DR: In this paper, a lower threshold voltage verify level for select physical states when programming the last word line to be programmed for a string during a program operation was proposed, and a lower program voltage for reading the states programmed using lower verify levels in some exemplary implementations.

...read moreread less

Abstract: Systems and methods in accordance with various embodiments can provide for reduced program disturb in non-volatile semiconductor memory. In one embodiment, select memory cells such as those connected to a last word line of a NAND string are programmed using one or more program verify levels or voltages that are different than a corresponding level used to program other cells or word lines. One exemplary embodiment includes using a lower threshold voltage verify level for select physical states when programming the last word line to be programmed for a string during a program operation. Another embodiment includes applying a lower program voltage to program memory cells of the last word line to select physical states. Additional read levels are established for reading the states programmed using lower verify levels in some exemplary implementations. A second program voltage step size that is larger than a nominal step size is used in one embodiment when programming select memory cells or word lines, such as the last word line to be programmed for a NAND string.

...read moreread less

402 citations

Journal Article•DOI•

Semi-supervised protein classification using cluster kernels

[...]

Jason Weston¹, Christina S. Leslie², Eugene Ie², Dengyong Zhou³, André Elisseeff³, William Stafford Noble⁴ - Show less +2 more•Institutions (4)

Princeton University¹, Columbia University², Max Planck Society³, University of Washington⁴

01 Aug 2005-Bioinformatics

TL;DR: Willett et al. as discussed by the authors developed simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences and showed that their methods greatly improved the classification performance of string kernels and outperformed standard approaches for adding close homologs of the positive examples to the training data.

...read moreread less

Abstract: Motivation: Building an accurate protein classification system depends critically upon choosing a good representation of the input sequences of amino acids. Recent work using string kernels for protein data has achieved state-of-the-art classification performance. However, such representations are based only on labeled data---examples with known 3D structures, organized into structural classes---whereas in practice, unlabeled data are far more plentiful. Results: In this work, we develop simple and scalable cluster kernel techniques for incorporating unlabeled data into the representation of protein sequences. We show that our methods greatly improve the classification performance of string kernels and outperform standard approaches for using unlabeled data, such as adding close homologs of the positive examples to the training data. We achieve equal or superior performance to previously presented cluster kernel methods and at the same time achieving far greater computationalefficiency. Availability: Source code is available at www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot. The Spider matlab package is available at www.kyb.tuebingen.mpg.de/bs/people/spider Contact: jasonw@nec-labs.com Supplementary information: www.kyb.tuebingen.mpg.de/bs/people/weston/semiprot

...read moreread less

242 citations

Journal Article•DOI•

Space efficient linear time construction of suffix arrays

[...]

Pang Ko¹, Srinivas Aluru¹•Institutions (1)

Iowa State University¹

01 Jun 2005-Journal of Discrete Algorithms

TL;DR: This result is one of the first linear time suffix array construction algorithms, which improve upon the previously known O ( n log n ) time direct algorithms for suffix sorting and can be used to derive a different linear time construction algorithm for suffix trees.

...read moreread less

234 citations

Patent•

Method and system for approximate string matching

[...]

Alexei Nevidomski¹, Pavel Volkov¹•Institutions (1)

IBM¹

16 Jun 2005

TL;DR: In this article, a trie data structure has a root node and generations of child nodes each node representing at least one character in an alphabet to provide a lexicon of words and word fragments.

...read moreread less

Abstract: A method and system are provided for approximate string matching of a target string to a trie data structure. The trie data structure has a root node and generations of child nodes each node representing at least one character in an alphabet to provide a lexicon of words and word fragments. The method involves traversing the trie data structure starting from the root node by comparing each node of a branch of the trie data structure to characters in the target string and adding characters traversed in a branch of the trie data structure to a gathered string to provide suggestions of approximate matches. If the method reaches a node flagged as a node for a word or a word fragment and, if the target string is longer than the gathered string, the method loops back to the root node, and continues the traverse from the root node. This enables the trie data structure to use word fragments for compound words and to split non-delimited words where appropriate. The method also includes, at each node, determining if there is a correction rule for one or more characters in the remainder of the target string from the current node, and if so, applying the correction rule to the target string to obtain a modified target string.

...read moreread less

204 citations

Patent•

Technique and apparatus for completing multiple zones

[...]

Gary L. Rytlewski¹, Ashish Sharma¹, Liana M Mitrea¹•Institutions (1)

Schlumberger¹

13 Dec 2005

TL;DR: In this article, an apparatus that is usable with a well includes a string and a plurality of tools that are mounted in the string, which are adapted to be placed in a state to catch objects (freefalling objects and/or pumped-down objects, as just a few examples).

...read moreread less

Abstract: An apparatus that is usable with a well includes a string and a plurality of tools that are mounted in the string. The string includes a passageway. The tools are mounted in the string and are adapted to be placed in a state to catch objects (free-falling objects and/or pumped-down objects, as just a few examples) of substantially the same size, which are communicated downhole through the passageway.

...read moreread less

193 citations

Proceedings Article•DOI•

On deriving unknown vulnerabilities from zero-day polymorphic and metamorphic worm exploits

[...]

Jedidiah R. Crandall, Zhendong Su, S. Felix Wu, Frederic T. Chong

07 Nov 2005

TL;DR: DACODA is subjected to quantitative analysis with a symbolic execution tool called DACODA and it is concluded that single contiguous byte string signatures are not effective for content filtering, and token-basedbyte string signatures composed of smaller substrings are only semantically rich enough to be effective forContent filtering if the vulnerability lies in a part of a protocol that is not commonly used.

...read moreread less

Abstract: Vulnerabilities that allow worms to hijack the control flow of each host that they spread to are typically discovered months before the worm outbreak, but are also typically discovered by third party researchers. A determined attacker could discover vulnerabilities as easily and create zero-day worms for vulnerabilities unknown to network defenses. It is important for an analysis tool to be able to generalize from a new exploit observed and derive protection for the vulnerability.Many researchers have observed that certain predicates of the exploit vector must be present for the exploit to work and that therefore these predicates place a limit on the amount of polymorphism and metamorphism available to the attacker. We formalize this idea and subject it to quantitative analysis with a symbolic execution tool called DACODA. Using DACODA we provide an empirical analysis of 14 exploits (seven of them actual worms or attacks from the Internet, caught by Minos with no prior knowledge of the vulnerabilities and no false positives observed over a period of six months) for four operating systems.Evaluation of our results in the light of these two models leads us to conclude that 1) single contiguous byte string signatures are not effective for content filtering, and token-based byte string signatures composed of smaller substrings are only semantically rich enough to be effective for content filtering if the vulnerability lies in a part of a protocol that is not commonly used, and that 2) practical exploit analysis must account for multiple processes, multithreading, and kernel processing of network data necessitating a focus on primitives instead of vulnerabilities.

...read moreread less

180 citations

Journal Article•DOI•

Medusa: a simple tool for interaction graph analysis

[...]

Sean D. Hooper, Peer Bork

15 Dec 2005-Bioinformatics

TL;DR: Medusa is a Java application for visualizing and manipulating graphs of interaction, such as data from the STRING database, that is optimized for accessing protein interaction data from STRING but can be used for any type of graph from any scientific field.

...read moreread less

Abstract: Summary: Medusa is a Java application for visualizing and manipulating graphs of interaction, such as data from the STRING database. It features an intuitive user interface developed with the help of biologists. Medusa is optimized for accessing protein interaction data from STRING, but can be used for any type of graph from any scientific field. Availability: Medusa, along with sample datasets and instructions, can be downloaded from http://www.bork.embl.de/medusa Contact: [email protected]

...read moreread less

174 citations

Patent•

Methods, systems, and products for translating text to speech

[...]

Steven Tischer¹, Robert Koch², Dale W. Malik•Institutions (2)

BellSouth¹, Nuance Communications²

05 Nov 2005

TL;DR: In this paper, a voice file storing multiple phrases is accessed, with the voice file mapping each phrase to a corresponding sequential string of phonemes, corresponding to the phrase, and retrieved and processed when translating the textual sequence to speech.

...read moreread less

Abstract: Methods, systems, and products are disclosed for translating text to speech. One such method receives content for translation to speech, identifies a textual sequence in the content, and correlates the textual sequence to a phrase. A voice file storing multiple phrases is accessed, with the voice file mapping each phrase to a corresponding sequential string of phonemes. The sequential string of phonemes, corresponding to the phrase, is retrieved and processed when translating the textual sequence to speech.

...read moreread less

155 citations

Patent•

Systems and methods for translating chinese pinyin to chinese characters

[...]

Jun Wu¹, Huican Zhu¹, Hongjun Zhu¹•Institutions (1)

Google¹

16 Mar 2005

TL;DR: This article proposed a method for translating a pinyin input to Chinese characters and words using a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyIN input.

...read moreread less

Abstract: Systems and methods to process and translate pinyin to Chinese characters and words are disclosed. A chinese language model is trained by extracting unknown character strings from Chinese inputs, e.g., documents and/or user inputs/queries, determining valid words from the unknown character strings, and generating a transition matrix based on the Chinese inputs for predicting a word string given the context. A method for translating a pinyin input generally includes generating a set of Chinese character strings from the pinyin input using a Chinese dictionary including words derived from the Chinese inputs and a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyin input. Ambiguous user input may be classified as non-pinyin or pinyin by identifying an ambiguous pinyin/non-pinyin ASCII word in the user input and analyzing the context to classify the user input.

...read moreread less

142 citations

Patent•

Battery management system and method

[...]

Charles E. Burns

15 Sep 2005

TL;DR: In this article, a battery management system is described for control of individual cells in a battery string, which includes a charger, a voltmeter, a selection circuit and a microprocessor.

...read moreread less

Abstract: A battery management system is disclosed for control of individual cells in a battery string. The battery management system includes a charger, a voltmeter, a selection circuit and a microprocessor. Under control of the microprocessor, the selection circuit connects each cell of the battery string to the charger and voltmeter. Information relating to battery performance is recorded and analyzed. The analysis depends upon the conditions under which the battery is operating. By monitoring the battery performance under different conditions, problems with individual cells can be determined and corrected.

...read moreread less

Patent•

Method and system for mapping strings for comparison

[...]

John I. McConnell¹, Julie D. Bennett¹, Yung-Shin Lin¹•Institutions (1)

Microsoft¹

29 Mar 2005

TL;DR: In this paper, a method and system for mapping a number of characters in a string, wherein the string comprises a combination of characters representing indexed expressions and a combination with characters representing non-indexed expressions, is presented.

...read moreread less

Abstract: A method and system for mapping a number of characters in a string, wherein the string comprises a combination of characters representing indexed expressions and a combination of characters representing non-indexed expressions. One embodiment produces a weight array that can be utilized to compare a first and second string having indexed and non-indexed expressions. In one embodiment, a method generates a set of special weights for characters that represent indexed and non-indexed expressions. The method then associates a weight value of an indexed expression with the specific group of characters representing a specific non-indexed expression, and generates a weight array by retrieving a plurality of special weights associated with the specific group of characters representing the specific non-indexed expression and the associated weight value of the indexed expression.

...read moreread less

Patent•

Methods and systems for selecting a language for text segmentation

[...]

Gilad Israel Elbaz¹, Jacob Leon Mandelson¹•Institutions (1)

Google¹

28 Sep 2005

TL;DR: In this paper, methods and systems for selecting a language for text segmentation are disclosed. But they do not specify a language classifier for each of the candidate languages and the second candidate language associated with a string of characters.

...read moreread less

Abstract: Methods and systems for selecting a language for text segmentation are disclosed. In one embodiment, at least a first candidate language and a second candidate language associated with a string of characters are identified, at least a first segmented result associated with the first candidate language and a second segmented result associated with the second candidate language are determined, a first frequency of occurrence for the first segmented result and a second frequency of occurrence for the second segmented result are determined, and an operable language is identified from the first candidate language and the second candidate language based at least in part on the first frequency of occurrence and the second frequency of occurrence.

...read moreread less

Journal Article•DOI•

Boosting textual compression in optimal linear time

[...]

Paolo Ferragina¹, Raffaele Giancarlo², Giovanni Manzini³, Marinella Sciortino²•Institutions (3)

University of Pisa¹, University of Palermo², University of Eastern Piedmont³

01 Jul 2005-Journal of the ACM

TL;DR: A general boosting technique for Textual Data Compression that can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts, and is very simple and optimal in terms of time.

...read moreread less

Abstract: We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression performance guarantee. It displays the following remarkable properties: (a) it can turn any memoryless compressor into a compression algorithm that uses the “best possible” contexts; (b) it is very simple and optimal in terms of time; and (c) it admits a decompression algorithm again optimal in time. To the best of our knowledge, this is the first boosting technique displaying these properties.Technically, our boosting technique builds upon three main ingredients: the Burrows--Wheeler Transform, the Suffix Tree data structure, and a greedy algorithm to process them. Specifically, we show that there exists a proper partition of the Burrows--Wheeler Transform of a string s that shows a deep combinatorial relation with the kth order entropy of s. That partition can be identified via a greedy processing of the suffix tree of s with the aim of minimizing a proper objective function over its nodes. The final compressed string is then obtained by compressing individually each substring of the partition by means of the base compressor we wish to boost.Our boosting technique is inherently combinatorial because it does not need to assume any prior probabilistic model about the source emitting s, and it does not deploy any training, parameter estimation and learning. Various corollaries are derived from this main achievement. Among the others, we show analytically that using our booster, we get better compression algorithms than some of the best existing ones, that is, LZ77, LZ78, PPMC and the ones derived from the Burrows--Wheeler Transform. Further, we settle analytically some long-standing open problems about the algorithmic structure and the performance of BWT-based compressors. Namely, we provide the first family of BWT algorithms that do not use Move-To-Front or Symbol Ranking as a part of the compression process.

...read moreread less

Proceedings Article•DOI•

Learning a Spelling Error Model from Search Query Logs

[...]

Farooq Ahmad¹, Grzegorz Kondrak¹•Institutions (1)

University of Alberta¹

06 Oct 2005

TL;DR: This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

...read moreread less

Abstract: Applying the noisy channel model to search query spelling correction requires an error model and a language model. Typically, the error model relies on a weighted string edit distance measure. The weights can be learned from pairs of misspelled words and their corrections. This paper investigates using the Expectation Maximization algorithm to learn edit distance weights directly from search query logs, without relying on a corpus of paired words.

...read moreread less

Report•DOI•

A conditional random field for discriminatively-trained finite-state string edit distance

[...]

Andrew McCallum¹, Kedar Bellare¹, Fernando Pereira²•Institutions (2)

University of Massachusetts Amherst¹, University of Pennsylvania²

26 Jul 2005

TL;DR: This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings, trained on both positive and negative instances of string pairs.

...read moreread less

Abstract: The need to measure sequence similarity arises in information extraction, object identity, data mining, biological sequence analysis, and other domains. This paper presents discriminative string-edit CRFs, a finite-state conditional random field model for edit sequences between strings. Conditional random fields have advantages over generative approaches to this problem, such as pair HMMs or the work of Ristad and Yianilos, because as conditionally-trained methods, they enable the use of complex, arbitrary actions and features of the input strings. As in generative models, the training data does not have to specify the edit sequences between the given string pairs. Unlike generative models, however, our model is trained on both positive and negative instances of string pairs. We present positive experimental results on several data sets.

...read moreread less

Patent•

Storing digital secrets in a vault

[...]

Michael Szydlo¹•Institutions (1)

RSA¹

02 Nov 2005

TL;DR: In this paper, a user answers to a number of different questions are obtained and a corresponding string of answers is generated, the string is hashed, and the resulting hash value is combined with the digital secret.

...read moreread less

Abstract: Methods and systems for storing secret information in a digital vault include obtaining from a user answers to a number of different questions, and identifying which subsets or combinations of the questions for which correct answers later provided by an entity will enable that entity to gain access to the secret information in the vault. The number of questions in each combination is less than the total number of questions, and at least one subset has at least two questions. For each subset, a corresponding string of answers is generated, the string is hashed, and the resulting hash value is combined with the digital secret. This hides the digital secret, which is then stored in the vault. Methods and systems for registering authentication material include storing a hashed string of answers for each combination, generating “multiple authenticators.”

...read moreread less

Journal Article•DOI•

A parallel decision tree-based method for user authentication based on keystroke patterns

[...]

Yong Sheng¹, Vir V. Phoha², S.M. Rovnyak³•Institutions (3)

Indiana University – Purdue University Indianapolis¹, Louisiana Tech University², Indiana University³

01 Aug 2005

TL;DR: A Monte Carlo approach to attain sufficient training data, a splitting method to improve effectiveness, and a system composed of parallel decision trees (DTs) to authenticate users based on keystroke patterns are proposed.

...read moreread less

Abstract: We propose a Monte Carlo approach to attain sufficient training data, a splitting method to improve effectiveness, and a system composed of parallel decision trees (DTs) to authenticate users based on keystroke patterns. For each user, approximately 19 times as much simulated data was generated to complement the 387 vectors of raw data. The training set, including raw and simulated data, is split into four subsets. For each subset, wavelet transforms are performed to obtain a total of eight training subsets for each user. Eight DTs are thus trained using the eight subsets. A parallel DT is constructed for each user, which contains all eight DTs with a criterion for its output that it authenticates the user if at least three DTs do so; otherwise it rejects the user. Training and testing data were collected from 43 users who typed the exact same string of length 37 nine consecutive times to provide data for training purposes. The users typed the same string at various times over a period from November through December 2002 to provide test data. The average false reject rate was 9.62% and the average false accept rate was 0.88%.

...read moreread less

Patent•

Controller circuitry for light emitting diodes

[...]

Da Liu, Lin Yung-Lin

11 Oct 2005

TL;DR: In this article, the authors present a method for supplying power to an LED array having at least a first string of LEDs and a second string of LED coupled in parallel, each of the strings includes at least two LEDs.

...read moreread less

Abstract: A method according to one embodiment may include supplying power to an LED array having at least a first string of LEDs and a second string of LEDs coupled in parallel, each of the strings includes at least two LEDs The method of this embodiment may also include comparing a first feedback signal from the first string of LEDs and a second feedback signal from the second string of LEDs The first feedback signal is proportional to current in said first string of LEDs and said second feedback signal is proportional to current in said second string of LEDs The method of this embodiment may also include controlling a voltage drop of at least the first string of LEDs to adjust the current of the first string of LEDs relative to the second string of LEDs, based on, at least in part, the comparing of the first and second feedback signals Of course, many alternatives, variations, and modifications are possible without departing from this embodiment

...read moreread less

Patent•

Word recognition using ontologies

[...]

Kas Kasravi, Maria Risov, Sundar Varadarajan

21 Nov 2005

TL;DR: In this article, an ontology is used to resolve ambiguities in an input string of characters, where the values of some of the characters in the language elements are uncertain.

...read moreread less

Abstract: Systems, and associated apparatus, methods, or computer program products, may use ontologies to provide improved word recognition. The ontologies may be applied in word recognition processes to resolve ambiguities in language elements (e.g., words) where the values of some of the characters in the language elements are uncertain. Implementations of the method may use an ontology to resolve ambiguities in an input string of characters, for example. In some implementations, the input string may be received from a language conversion source such as, for example, an optical character recognition (OCR) device that generates a string of characters in electronic form from visible character images, or a voice recognition (VR) device that generates a string of characters in electronic form from speech input. Some implementations may process the generated character strings by using an ontology in combination with syntactic and/or grammatical analysis engines to further improve word recognition accuracy.

...read moreread less

Patent•

Question answering over structured content on the web

[...]

Yevgeny E. Agichtein¹, Christopher J. C. Burges¹, Eric D. Brill¹•Institutions (1)

Microsoft¹

21 Oct 2005

TL;DR: In this article, the structured content and associated metadata from the Web are leveraged to provide specific answer string responses to user questions, which can also be indexed at crawl-time to facilitate searching of the content at search-time.

...read moreread less

Abstract: Structured content and associated metadata from the Web are leveraged to provide specific answer string responses to user questions. The structured content can also be indexed at crawl-time to facilitate searching of the content at search-time. Ranking techniques can also be employed to facilitate in providing an optimum answer string and/or a top K list of answer strings for a query. Ranking can be based on trainable algorithms that utilize feature vectors for candidate answer strings. In one instance, at crawl-time, structured content is indexed and automatically associated with metadata relating to the structured content and the source web page. At search-time, candidate indexed structured content is then utilized to extract an appropriate answer string in response to a user query.

...read moreread less

Book Chapter•DOI•

Format String Attacks

[...]

James C. Foster, Vitaly Osipov, Nish Bhalla, Niels Heinen, Dave Aitel - Show less +1 more

01 Jan 2005

TL;DR: To prevent format string bugs employing user-controlled variables as the format string argument in all relevant functions should be avoided—or even better, a constant format string should be used wherever possible.

...read moreread less

Abstract: Format string vulnerabilities occur when programmers pass externally supplied data to a printf function (or similar) as, or as part of, the format string argument. Printf functions, and bugs due to the misuse of them, have been around for years. However, no one ever conceived of exploiting them to force the execution of shellcode until the year 2000. In addition to format string bugs, new techniques have emerged such as overwriting malloc structures, relying on free() to overwrite pointers, and using signed integer index errors. Format bugs appear because of the interplay of C functions with variable numbers of arguments and the power of format specification tokens, which sometimes allow writing values on the stack. Techniques for exploiting format string bugs require many calculations, which are usually automated with scripts. When a format string in printf (or any similar function) is controlled by an attacker, under certain conditions the memory and read arbitrary data can be modified simply by supplying a specially crafted format string. To prevent format string bugs employing user-controlled variables as the format string argument in all relevant functions should be avoided—or even better, a constant format string should be used wherever possible. Searching for format string bugs is easy compared to the cases of stack or heap overflows, both in source code and in existing binaries.

...read moreread less

Patent•

Switching circuit implementing variable string matching

[...]

Michael J. Miller¹, Vladan Djakovic¹•Institutions (1)

Integrated Device Technology¹

11 Oct 2005

TL;DR: A content matching engine (CME) as discussed by the authors uses a content addressable memory (CAM) array that stores a plurality of strings in separate entries, which can be linked by per-entry counters associated with each string, or by a state machine.

...read moreread less

Abstract: A content matching engine (CME) uses a content addressable memory (CAM) array that stores a plurality of strings in separate entries. The strings define one or more rules to be matched. The strings of each rule are linked, thereby providing a required order. The strings of each rule can be linked by per-entry counters associated with each string, or by a state machine. The strings in the CAM array are compared with a packet, which is shifted one symbol at a time (because the strings can start on any boundary). When the CAM detects a match, the CAM skips over the string that resulted in the match, thereby preventing erroneous matches. The CAM allows parallel matching to be performed for multiple rules. If the contents of a packet match all of the strings of a rule, in order, then the CME asserts a match/index signal that identifies the matched rule.

...read moreread less

Patent•

Real-time data localization

[...]

Jonah Petri¹, Andrew Wilson¹, Christopher E. Hansten¹, James F. Kateley¹•Institutions (1)

Apple Inc.¹

25 Aug 2005

TL;DR: In this article, a method, apparatus, and system are provided for performing real-time or near-real-time localization of data, which comprises monitoring an input string and comparing a semantic associated with the input string to a semantics associated with at least, one stored string.

...read moreread less

Abstract: A method, apparatus, and system are provided for performing a real-time or a near real-time localization of data. The method comprises monitoring an input string and comparing a semantic associated with the input string to a semantic associated with at least, one stored string. The method further comprises providing the stored string as an alternative to the input string.

...read moreread less

Patent•

Method and apparatus for processing text and character data

[...]

Michael C. Battilana

18 May 2005

TL;DR: In this article, a text processing system receives a character input string and determines whether to apply character processing, such as accent or punctuation, in a non-English language such as Italian.

...read moreread less

Abstract: An apparatus and method for processing text or character data are disclosed. A text processing system receives a character input string and determines whether to apply character processing. A non-English language such as Italian can be entered into a processing system such as a computer using a standard English based keyboard such that additional keys for providing accents or other grammatical and punctuation symbols or characters not existing in English are not required. In one mode, text is automatically accented or punctuated without requiring user intervention. In another mode, a user is provided with a list of accent or punctuation choices so that the user may select the optimum accent or punctuation. Text processing of an input may be activated by a text sequence including a possible vowel accent or apostrophe error, and may continue as an input method editor loop in response to repeated actuations of the key associated with the first activation event. When an activator event input is detected, a rules based system is utilized to select a correctly accented and punctuated character. A list of alternative accents and punctuations is optionally displayed, and a user may toggle through the list using the activator event to select a desired character. The display provides information for a level of certainty of a selected character or word.

...read moreread less

Journal Article•DOI•

A fully linear-time approximation algorithm for grammar-based compression

[...]

Hiroshi Sakamoto¹•Institutions (1)

Kyushu Institute of Technology¹

01 Jun 2005-Journal of Discrete Algorithms

TL;DR: An optimization problem to minimize the size of a context-free grammar deriving a given string by guaranteeing O ( log n g ∗ ) approximation ratio without suffix tree construction is presented.

...read moreread less

Patent•

Code, system, and method for generating concepts

[...]

Peter Dehlinger, Shao Chin

02 Feb 2005

TL;DR: In this paper, a genetic algorithm is used to find one or more high fitness strings, based on the application of a fitness metric which quantifies, e.g., the number occurrence of pairs of terms in texts in a selected library of texts.

...read moreread less

Abstract: Disclosed are a computer-readable code, system and method for generating candidate novel concepts in one or more selected fields. The system operates to generate strings of terms composed of combinations of word and optionally, word-group terms that are descriptive of concept elements in such field(s), and uses a genetic algorithm to find one or more high fitness strings, based on the application of a fitness metric which quantifies, e.g., the number occurrence of pairs of terms in texts in a selected library of texts. The highest- score string or strings are then applied in a database search to identify one or more pairs of primary and secondary texts whose terms overlap with those of a high fitness string.

...read moreread less

Journal Article•DOI•

Practical methods for constructing suffix trees

[...]

Yuanyuan Tian¹, Sandeep Tata¹, Richard A. Hankins², Jignesh M. Patel¹•Institutions (2)

University of Michigan¹, Intel²

01 Sep 2005

TL;DR: This paper presents a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and shows that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

...read moreread less

Abstract: Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.In this paper, we explore suffix tree construction algorithms over a wide spectrum of data sources and sizes. First, we show that on modern processors, a cache-efficient algorithm with O(n2) worst-case complexity outperforms popular linear time algorithms like Ukkonen and McCreight, even for in-memory construction. For larger datasets, the disk I/O requirement quickly becomes the bottleneck in each algorithm's performance. To address this problem, we describe two approaches. First, we present a buffer management strategy for the O(n2) algorithm. The resulting new algorithm, which we call “Top Down Disk-based” (TDD), scales to sizes much larger than have been previously described in literature. This approach far outperforms the best known disk-based construction methods. Second, we present a new disk-based suffix tree construction algorithm that is based on a sort-merge paradigm, and show that for constructing very large suffix trees with very little resources, this algorithm is more efficient than TDD.

...read moreread less

Journal Article•DOI•

A neural syntactic language model

[...]

Ahmad Emami¹, Frederick Jelinek¹•Institutions (1)

Johns Hopkins University¹

01 Sep 2005-Machine Learning

TL;DR: The neural syntactic based model achieves the best published results in perplexity and WER for the given data sets and comparisons with the standard and neural net based N-gram models with arbitrarily long contexts show that the syntactic information is in fact very helpful in estimating the word string probability.

...read moreread less

Abstract: This paper presents a study of using neural probabilistic models in a syntactic based language model. The neural probabilistic model makes use of a distributed representation of the items in the conditioning history, and is powerful in capturing long dependencies. Employing neural network based models in the syntactic based language model enables it to use efficiently the large amount of information available in a syntactic parse in estimating the next word in a string. Several scenarios of integrating neural networks in the syntactic based language model are presented, accompanied by the derivation of the training procedures involved. Experiments on the UPenn Treebank and the Wall Street Journal corpus show significant improvements in perplexity and word error rate over the baseline SLM. Furthermore, comparisons with the standard and neural net based N-gram models with arbitrarily long contexts show that the syntactic information is in fact very helpful in estimating the word string probability. Overall, our neural syntactic based model achieves the best published results in perplexity and WER for the given data sets.

...read moreread less

Patent•

Comprehensive erase verification for non-volatile memory

[...]

Dat Tran¹, Kiran Ponnuru¹, Jian Chen¹, Jeffrey W. Lutze¹, Jun Wan¹ - Show less +1 more•Institutions (1)

SanDisk¹

21 Dec 2005

TL;DR: In this article, the results of erasing a NAND string can be verified by testing charging of the string in a plurality of directions with the storage elements biased to turn on if in an erased state.

...read moreread less

Abstract: Systems and methods in accordance with various embodiments can provide for comprehensive erase verification and defect detection in non-volatile semiconductor memory. In one embodiment, the results of erasing a group of storage elements is verified using a plurality of test conditions to better detect defective and/or insufficiently erased storage elements of the group. For example, the results of erasing a NAND string can be verified by testing charging of the string in a plurality of directions with the storage elements biased to turn on if in an erased state. If a string of storage elements passes a first test process or operation but fails a second test process or operation, the string can be determined to have failed the erase process and possibly be defective. By testing charging or conduction of the string in a plurality of directions, defects in any transistors of the string that are masked under one set of conditions may be exposed under a second set of bias conditions. For example, a string may pass an erase verification operation but then be read as including one or more programmed storage elements. Such a string can be defective and mapped out of the memory device.

...read moreread less

Collapse