scispace - formally typeset
Search or ask a question
Topic

String (computer science)

About: String (computer science) is a research topic. Over the lifetime, 19430 publications have been published within this topic receiving 333247 citations. The topic is also known as: str & s.


Papers
More filters
Book ChapterDOI
01 Dec 2003
TL;DR: Several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data are introduced, and it is shown that these new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.
Abstract: We introduce several new families of string kernels designed in particular for use with support vector machines (SVMs) for classification of protein sequence data. These kernels – restricted gappy kernels, substitution kernels, and wildcard kernels – are based on feature spaces indexed by k-length subsequences from the string alphabet Σ (or the alphabet augmented by a wildcard character), and hence they are related to the recently presented (k,m)-mismatch kernel and string kernels used in text classification. However, for all kernels we define here, the kernel value K(x,y) can be computed in O(c K (|x| + |y|)) time, where the constant c K depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant \(c_K = k^{m+1} |\Sigma|^m\) of the mismatch kernel. We compute the kernels efficiently using a recursive function based on a trie data structure and relate our new kernels to the recently described transducer formalism. Finally, we report protein classification experiments on a benchmark SCOP dataset, where we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models.

62 citations

Proceedings ArticleDOI
26 Apr 2007
TL;DR: A novel method for evaluating the output of Machine Translation, based on comparing the dependency structures of the translation and reference rather than their surface string forms, which reaches high correlation with human scores.
Abstract: We present a novel method for evaluating the output of Machine Translation (MT), based on comparing the dependency structures of the translation and reference rather than their surface string forms. Our method uses a treebank-based, widecoverage, probabilistic Lexical-Functional Grammar (LFG) parser to produce a set of structural dependencies for each translation-reference sentence pair, and then calculates the precision and recall for these dependencies. Our dependency-based evaluation, in contrast to most popular string-based evaluation metrics, will not unfairly penalize perfectly valid syntactic variations in the translation. In addition to allowing for legitimate syntactic differences, we use paraphrases in the evaluation process to account for lexical variation. In comparison with other metrics on 16,800 sentences of Chinese-English newswire text, our method reaches high correlation with human scores. An experiment with two translations of 4,000 sentences from Spanish-English Europarl shows that, in contrast to most other metrics, our method does not display a high bias towards statistical models of translation.

61 citations

Proceedings ArticleDOI
01 Dec 2013
TL;DR: An attributes-based approach to multi-writer word spotting that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare is proposed.
Abstract: We propose an approach to multi-writer word spotting, where the goal is to find a query word in a dataset comprised of document images. We propose an attributes-based approach that leads to a low-dimensional, fixed-length representation of the word images that is fast to compute and, especially, fast to compare. This approach naturally leads to an unified representation of word images and strings, which seamlessly allows one to indistinctly perform query-by-example, where the query is an image, and query-by-string, where the query is a string. We also propose a calibration scheme to correct the attributes scores based on Canonical Correlation Analysis that greatly improves the results on a challenging dataset. We test our approach on two public datasets showing state-of-the-art results.

61 citations

Patent
30 Dec 2009
TL;DR: In this paper, the form data to be obscured is removed from a form and inserted as a portion of a Uniform Resource Location (URL) string, and an obfuscation is then applied to the portion of the URL string, thereby obscuring the information for sending on an outbound message.
Abstract: Obscuring form data to be passed in forms that are sent in messages over a communications network. The form data to be obscured is removed from a form and inserted as a portion of a Uniform Resource Location (“URL”) string. The obscured form data may comprise hidden fields and/or links. An obfuscation is then applied to the portion of the URL string, thereby obscuring the information for sending on an outbound message. The original information is recovered from an inbound message which contains the obscured information by reversing the processing used for the obscuring. In one aspect, the obfuscation comprises encryption. In another aspect, the obfuscation comprises creating a tiny URL that replaces the portion of the URL string.

61 citations

Proceedings ArticleDOI
03 Nov 1993
TL;DR: An algorithm that computes a deterministic sample of a sufficiently long substring in constant time for string matching, solving the main open problem remaining in string matching.
Abstract: All algorithms below are optimal alphabet-independent parallel CRCW PRAM algorithms. In one dimension: Given a pattern string of length m for the string-matching problem, we design an algorithm that computes a deterministic sample of a sufficiently long substring in constant time. This problem used to be a bottleneck in the pattern preprocessing for one- and two-dimensional pattern matching. The best previous time bound was O(log/sup 2/ m/log log m). We use this algorithm to obtain the following results. 1. Improving the preprocessing of the constant-time text search algorithm from O(log/sup 2/ m/log log m) to n(log log m), which is now best possible. 2. A constant-time deterministic string-matching algorithm in the case that the text length n satisfies n=/spl Omega/(m/sup 1+/spl epsiv//) for a constant /spl epsiv/>0. 3. A simple probabilistic string-matching algorithm that has constant time with high probability for random input. 4. A constant expected time Las-Vegas algorithm for computing the period of the pattern and all witnesses and thus string matching itself, solving the main open problem remaining in string matching. >

61 citations


Network Information
Related Topics (5)
Time complexity
36K papers, 879.5K citations
88% related
Tree (data structure)
44.9K papers, 749.6K citations
86% related
Graph (abstract data type)
69.9K papers, 1.2M citations
85% related
Computational complexity theory
30.8K papers, 711.2K citations
82% related
Supervised learning
20.8K papers, 710.5K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20222
2021491
2020704
2019759
2018816
2017806