scispace - formally typeset
Search or ask a question
Topic

String (computer science)

About: String (computer science) is a research topic. Over the lifetime, 19430 publications have been published within this topic receiving 333247 citations. The topic is also known as: str & s.


Papers
More filters
Proceedings ArticleDOI
13 May 2013
TL;DR: This paper presents three different trie-based data structures to address the case where the string set is so large that compression is needed to fit the data structure in memory, and shows that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data.
Abstract: Virtually every modern search application, either desktop, web, or mobile, features some kind of query auto-completion. In its basic form, the problem consists in retrieving from a string set a small number of completions, i.e. strings beginning with a given prefix, that have the highest scores according to some static ranking. In this paper, we focus on the case where the string set is so large that compression is needed to fit the data structure in memory. This is a compelling case for web search engines and social networks, where it is necessary to index hundreds of millions of distinct queries to guarantee a reasonable coverage; and for mobile devices, where the amount of memory is limited. We present three different trie-based data structures to address this problem, each one with different space/time/complexity trade-offs. Experiments on large-scale datasets show that it is possible to compress the string sets, including the scores, down to spaces competitive with the gzip'ed data, while supporting efficient retrieval of completions at about a microsecond per completion.

67 citations

Proceedings ArticleDOI
23 Jan 2011
TL;DR: A comprehensive set of algorithms and data structures for performing fast automata operations for string constraint solving is studied to provide an apples-to-apples comparison between techniques that are used in current tools.
Abstract: There has been significant recent interest in automated reasoning techniques, in particular constraint solvers, for string variables. These techniques support a wide variety of clients, ranging from static analysis to automated testing. The majority of string constraint solvers rely on finite automata to support regular expression constraints. For these approaches, performance depends critically on fast automata operations such as intersection, complementation, and determinization. Existing work in this area has not yet provided conclusive results as to which core algorithms and data structures work best in practice.In this paper, we study a comprehensive set of algorithms and data structures for performing fast automata operations. Our goal is to provide an apples-to-apples comparison between techniques that are used in current tools. To achieve this, we re-implemented a number of existing techniques. We use an established set of regular expressions benchmarks as an indicative workload. We also include several techniques that, to the best of our knowledge, have not yet been used for string constraint solving. Our results show that there is a substantial performance difference across techniques, which has implications for future tool design.

67 citations

Patent
18 Sep 2002
TL;DR: In this paper, a method and apparatus for analyzing documents and determining the association between words in a language is presented, which includes providing a collection of documents (306), selecting a first word or word strings, and a second word or string occurring in the documents.
Abstract: A method and apparatus for analyzing documents and thereby determining the association between words in a language (Fig. 3). The method includes providing a collection of documents (306), selecting a first word or word strings, and a second word or word string occurring in the documents. The method further involves associating first word or word strings and second word or word strings with common word or word string based on frequency of occurrence of the common word or word strings within the ranges (304).

67 citations

Journal ArticleDOI
TL;DR: It is shown that ρ(n)≤n and there are at most O.67n runs with periods larger than 87, which supports the conjecture that the number of all runs is smaller than n.
Abstract: A run in a string is a nonextendable (with the same minimal period) periodic segment in a string. The set of runs corresponds to the structure of internal periodicities in a string. Periodicities in strings were extensively studied and are important both in theory and practice (combinatorics of words, pattern-matching, computational biology). Let ρ(n) be the maximal number of runs in a string of length n. It has been shown that ρ(n)=O(n), the proof was very complicated and the constant coefficient in O(n) has not been given explicitly. We demystify the proof of the linear upper bound for ρ(n) and propose a new approach to the analysis of runs based on the properties of subperiods:the periods of periodic parts of the runs We show that ρ(n)≤n and there are at most O.67n runs with periods larger than 87. This supports the conjecture that the number of all runs is smaller than n. We also give a completely new proof of the linear bound and discover several new interesting "periodicity lemmas".

67 citations

Patent
Fuliang Weng1, Lin Zhao1
16 Jun 2005
TL;DR: A method of proper name recognition includes classifying each word of a word string with a tag indicating a proper name entity category or a non-named entity category, and correcting the tag of a boundary word of the word string.
Abstract: A method of proper name recognition includes classifying each word of a word string with a tag indicating a proper name entity category or a non-named entity category, and correcting the tag of a boundary word of the word string.

67 citations


Network Information
Related Topics (5)
Time complexity
36K papers, 879.5K citations
88% related
Tree (data structure)
44.9K papers, 749.6K citations
86% related
Graph (abstract data type)
69.9K papers, 1.2M citations
85% related
Computational complexity theory
30.8K papers, 711.2K citations
82% related
Supervised learning
20.8K papers, 710.5K citations
80% related
Performance
Metrics
No. of papers in the topic in previous years
YearPapers
20222
2021491
2020704
2019759
2018816
2017806