Home
/
Authors
/
Daniel S. Hirschberg

Author

Daniel S. Hirschberg

Other affiliations: Princeton University, French Institute for Research in Computer Science and Automation, Rice University ...read more

Bio: Daniel S. Hirschberg is an academic researcher from University of California, Irvine. The author has contributed to research in topics: Longest increasing subsequence & Longest common subsequence problem. The author has an hindex of 30, co-authored 92 publications receiving 5323 citations. Previous affiliations of Daniel S. Hirschberg include Princeton University & French Institute for Research in Computer Science and Automation.

Papers published on a yearly basis

2018
2016
2015
2013
2012
2011
2010
2009
2008
2007
2006
2005
2003
2002
2001
2000
1999
1998
1997
1995
1994
1993
1992
1991
1990
1989
1987
1986
1985
1980
1979
1978
1977
1976
1975
1974
1973

Papers

PDF

Open Access

More filters

Journal Article•DOI•

A linear space algorithm for computing maximal common subsequences

[...]

Daniel S. Hirschberg¹•Institutions (1)

Princeton University¹

01 Jun 1975-Communications of The ACM

TL;DR: The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space and an algorithm is presented which will solve this problem in QuadraticTime and in linear space.

...read moreread less

Abstract: The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. An algorithm is presented which will solve this problem in quadratic time and in linear space.

...read moreread less

1,164 citations

Journal Article•DOI•

Algorithms for the Longest Common Subsequence Problem

[...]

Daniel S. Hirschberg¹•Institutions (1)

Rice University¹

01 Oct 1977-Journal of the ACM

TL;DR: A lgor i thm is appl icable in the genera l case and requi res O ( p n + n log n) t ime for any input strings o f lengths m and n even though the lower bound on T ime of O ( m n ) need not apply to all inputs.

...read moreread less

Abstract: We start by def ining conven t ions and t e rmino logy that will be used th roughou t this paper . String C = clc~ ... cp is a subsequence of string A = aja2 "'" am if there is a mapp ing F : {1, 2 . . . . , p} ~ {1, 2, ... , m} such that F(i) = k only if c~ = ak and F is a m o n o t o n e strictly increasing funct ion (i .e. F(i) = u, F(]) = v, and i < j imply that u < v). C can be fo rmed by delet ing m p (not necessari ly ad jacen t ) symbols f rom A . F o r example , " c o u r s e " is a subsequence of " c o m p u t e r sc ience . " Str ing C is a c o m m o n s ubs equence of strings A and B if C is a s u b s e q u e n c e of A and also a subsequence of B. String C is a longest c o m m o n subsequence (abbrev ia ted LCS) of string A and B if C is a c o m m o n subsequence of A and B of maximal length , i .e. there is no c o m m o n subsequence of A and B that has grea te r length. Th roughou t this paper , we assume that A and B are strings of lengths m and n , m _< n , that have an LCS C of (unknown) length p . We assume that the symbols that may appea r in these strings c o m e f rom some a lphabet of size t . A symbol can be s tored in m e m o r y by using log t bits, which we assume will fit in one word of memory . Symbols can be c o m p a r e d (a -< b?) in one t ime unit . The n u m b e r of di f ferent symbols that actual ly appear in string B is def ined to be s (which must be less than n and t). The longest c o m m o n s u b s e q u e n c e prob lem has been solved by using a recurs ion re la t ionship on the length of the solut ion [7, 12, 16, 21]. These are general ly appl icable a lgor i thms that take O ( m n ) t ime for any input strings o f lengths m and n even though the lower bound on t ime of O ( m n ) need not apply to all inputs [2]. We present a lgor i thms that , depend ing on the na ture of the Input, may not requ i re quadra t ic t ime to r ecove r an LCS. The first a lgor i thm is appl icable in the genera l case and requi res O ( p n + n log n) t ime. T h e second a lgor i thm requi res t ime b o u n d e d by O((m + 1 p )p log n). In the c o m m o n special case where p is close to m , this a lgor i thm takes t ime

...read moreread less

799 citations

Journal Article•DOI•

Data compression

[...]

Debra A. Lelewer¹, Daniel S. Hirschberg¹•Institutions (1)

University of California, Irvine¹

01 Sep 1987-ACM Computing Surveys

TL;DR: A variety of data compression methods are surveyed, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986, which has important application in the areas of file storage and distributed systems.

...read moreread less

Abstract: This paper surveys a variety of data compression methods spanning almost 40 years of research, from the work of Shannon, Fano, and Huffman in the late 1940s to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory as they relate to the goals and evaluation of data compression methods are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported, and possibilities for future research are suggested.

...read moreread less

581 citations

Journal Article•DOI•

Bounds on the Complexity of the Longest Common Subsequence Problem

[...]

Jeffrey D. Ullman¹, Alfred V. Aho², Daniel S. Hirschberg³•Institutions (3)

Princeton University¹, Bell Labs², Rice University³

01 Jan 1976-Journal of the ACM

TL;DR: It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings.

...read moreread less

Abstract: The problem of finding a longest common subsequence of two strings is discussed. This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution. The difficulty of computing a longest common subsequence of two strings is examined using the decision tree model of computation, in which vertices represent “equal - unequal” comparisons. It is shown that unless a bound on the total number of distinct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings. A general lower bound as a function of the ratio of alphabet size to string length is derived. The case where comparisons between symbols of the same string are forbidden is also considered and it is shown that this problem is of linear complexity for a two-symbol alphabet and quadratic for an alphabet of three or more symbols.

...read moreread less

273 citations

Journal Article•DOI•

Computing connected components on parallel computers

[...]

Daniel S. Hirschberg¹, Ashok K. Chandra², Dilip V. Sarwate³•Institutions (3)

Rice University¹, IBM², University of Illinois at Urbana–Champaign³

01 Aug 1979-Communications of The ACM

TL;DR: A parallel algorithm which uses n=2 processors to find the connected components of an undirected graph with n vertices in time in time O(n), which can be used to finding the transitive closure of a symmetric Boolean matrix.

...read moreread less

Abstract: We present a parallel algorithm which uses n2 processors to find the connected components of an undirected graph with n vertices in time O(log2n). An O(log2n) time bound also can be achieved using only n⌈n/⌈log2n⌉⌉ processors. The algorithm can be used to find the transitive closure of a symmetric Boolean matrix. We assume that the processors have access to a common memory. Simultaneous access to the same location is permitted for fetch instructions but not for store instructions.

...read moreread less

266 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

Collapse

Cited by

PDF

Open Access

More filters

Book•

Data Mining: Concepts and Techniques

[...]

Jiawei Han¹, Micheline Kamber², Jian Pei²•Institutions (2)

University of Illinois at Urbana–Champaign¹, Simon Fraser University²

08 Sep 2000

TL;DR: This book presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects, and provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data.

...read moreread less

Abstract: The increasing volume of data in modern business and science calls for more complex and sophisticated tools. Although advances in data mining technology have made extensive data collection much easier, it's still always evolving and there is a constant need for new techniques and tools that can help us transform this data into useful information and knowledge. Since the previous edition's publication, great advances have been made in the field of data mining. Not only does the third of edition of Data Mining: Concepts and Techniques continue the tradition of equipping you with an understanding and application of the theory and practice of discovering patterns hidden in large data sets, it also focuses on new, important topics in the field: data warehouses and data cube technology, mining stream, mining social networks, and mining spatial, multimedia and other complex data. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. This is the resource you need if you want to apply today's most powerful data mining techniques to meet real business challenges. * Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects. * Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields. *Provides a comprehensive, practical look at the concepts and techniques you need to get the most out of real business data

...read moreread less

23,600 citations

Journal Article•DOI•

VSEARCH: a versatile open source tool for metagenomics

[...]

Torbjørn Rognes¹, Torbjørn Rognes², Tomas Flouri³, Tomas Flouri⁴, Ben Nichols⁵, Christopher Quince⁵, Christopher Quince⁶, Frédéric Mahé⁷ - Show less +4 more•Institutions (7)

University of Oslo¹, Oslo University Hospital², Heidelberg Institute for Theoretical Studies³, Karlsruhe Institute of Technology⁴, University of Glasgow⁵, University of Warwick⁶, Kaiserslautern University of Technology⁷

18 Oct 2016-PeerJ

TL;DR: VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with US EARCH for paired-ends read merging and dereplication.

...read moreread less

Abstract: Background: VSEARCH is an open source and free of charge multithreaded 64-bit tool for processing and preparing metagenomics, genomics and population genomics nucleotide sequence data. It is designed as an alternative to the widely used USEARCH tool (Edgar, 2010) for which the source code is not publicly available, algorithm details are only rudimentarily described, and only a memory-confined 32-bit version is freely available for academic use. Methods: When searching nucleotide sequences, VSEARCH uses a fast heuristic based on words shared by the query and target sequences in order to quickly identify similar sequences, a similar strategy is probably used in USEARCH. VSEARCH then performs optimal global sequence alignment of the query against potential target sequences, using full dynamic programming instead of the seed-and-extend heuristic used by USEARCH. Pairwise alignments are computed in parallel using vectorisation and multiple threads. Results: VSEARCH includes most commands for analysing nucleotide sequences available in USEARCH version 7 and several of those available in USEARCH version 8, including searching (exact or based on global alignment), clustering by similarity (using length pre-sorting, abundance pre-sorting or a user-defined order), chimera detection (reference-based or de novo), dereplication (full length or prefix), pairwise alignment, reverse complementation, sorting, and subsampling. VSEARCH also includes commands for FASTQ file processing, i.e., format detection, filtering, read quality statistics, and merging of paired reads. Furthermore, VSEARCH extends functionality with several new commands and improvements, including shuffling, rereplication, masking of low-complexity sequences with the well-known DUST algorithm, a choice among different similarity definitions, and FASTQ file format conversion. VSEARCH is here shown to be more accurate than USEARCH when performing searching, clustering, chimera detection and subsampling, while on a par with USEARCH for paired-ends read merging. VSEARCH is slower than USEARCH when performing clustering and chimera detection, but significantly faster when performing paired-end reads merging and dereplication. VSEARCH is available at https://github.com/torognes/vsearch under either the BSD 2-clause license or the GNU General Public License version 3.0. Discussion: VSEARCH has been shown to be a fast, accurate and full-fledged alternative to USEARCH. A free and open-source versatile tool for sequence analysis is now available to the metagenomics community.

...read moreread less

5,850 citations

Journal Article•DOI•

CAP3: A DNA Sequence Assembly Program

[...]

Xiaoqiu Huang¹, Anup Madan•Institutions (1)

Michigan Technological University¹

01 Sep 1999-Genome Research

TL;DR: The third generation of the CAP sequence assembly program is described, which has a capability to clip 5' and 3' low-quality regions of reads and uses forward-reverse constraints to correct assembly errors and link contigs.

...read moreread less

Abstract: The shotgun sequencing strategy has been used widely in genome sequencing projects. A major phase in this strategy is to assemble short reads into long sequences. A number of DNA sequence assembly programs have been developed (Staden 1980; Peltola et al. 1984; Huang 1992; Smith et al. 1993; Gleizes and Henaut 1994; Lawrence et al. 1994; Kececioglu and Myers 1995; Sutton et al. 1995; Green 1996). The FAKII program provides a library of routines for each phase of the assembly process (Larson et al. 1996). The GAP4 program has a number of useful interactive features (Bonfield et al. 1995). The PHRAP program clips 5′ and 3′ low-quality regions of reads and uses base quality values in evaluation of overlaps and generation of contig sequences (Green 1996). TIGR Assembler has been used in a number of megabase microbial genome projects (Sutton et al. 1995). Continued development and improvement of sequence assembly programs are required to meet the challenges of the human, mouse, and maize genome projects. We have developed the third generation of the CAP sequence assembly program (Huang 1992). The CAP3 program includes a number of improvements and new features. A capability to clip 5′ and 3′ low-quality regions of reads is included in the CAP3 program. Base quality values produced by PHRED (Ewing et al. 1998) are used in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. Efficient algorithms are employed to identify and compute overlaps between reads. Forward–reverse constraints are used to correct assembly errors and link contigs. Results of CAP3 on four BAC data sets are presented. The performance of CAP3 was compared with that of PHRAP on a number of BAC data sets. PHRAP often produces longer contigs than CAP3 whereas CAP3 often produces fewer errors in consensus sequences than PHRAP. It is easier to construct scaffolds with CAP3 than with PHRAP on low-pass data with forward–reverse constraints. An unusual feature of CAP3 is the use of forward–reverse constraints in the construction of contigs. A forward–reverse constraint is often produced by sequencing of both ends of a subclone. A forward–reverse constraint specifies that the two reads should be on the opposite strands of the DNA molecule within a specified range of distance. By sequencing both ends of each subclone, a large number of forward–reverse constraints are produced for a cosmid or BAC data set. A difficulty with use of forward–reverse constraints in assembly is that some of the forward–reverse constraints are incorrect because of errors in lane tracking and cloning. Our strategy for dealing with this difficulty is based on the observation that a majority of the constraints are correct and wrong constraints usually occur randomly. Thus, a few unsatisfied constraints in a contig may not be sufficient to indicate an assembly error in the contig. However, if a sufficient number of constraints are all inconsistent with a join in a contig and all support an alternative join, it is likely that the current join is an error, and the alternative join should be made.

...read moreread less

5,074 citations

Book•

Distributed algorithms

[...]

Nancy Lynch

01 Jan 1996

TL;DR: This book familiarizes readers with important problems, algorithms, and impossibility results in the area, and teaches readers how to reason carefully about distributed algorithms-to model them formally, devise precise specifications for their required behavior, prove their correctness, and evaluate their performance with realistic measures.

...read moreread less

Abstract: In Distributed Algorithms, Nancy Lynch provides a blueprint for designing, implementing, and analyzing distributed algorithms. She directs her book at a wide audience, including students, programmers, system designers, and researchers. Distributed Algorithms contains the most significant algorithms and impossibility results in the area, all in a simple automata-theoretic setting. The algorithms are proved correct, and their complexity is analyzed according to precisely defined complexity measures. The problems covered include resource allocation, communication, consensus among distributed processes, data consistency, deadlock detection, leader election, global snapshots, and many others. The material is organized according to the system model-first by the timing model and then by the interprocess communication mechanism. The material on system models is isolated in separate chapters for easy reference. The presentation is completely rigorous, yet is intuitive enough for immediate comprehension. This book familiarizes readers with important problems, algorithms, and impossibility results in the area: readers can then recognize the problems when they arise in practice, apply the algorithms to solve them, and use the impossibility results to determine whether problems are unsolvable. The book also provides readers with the basic mathematical tools for designing new algorithms and proving new impossibility results. In addition, it teaches readers how to reason carefully about distributed algorithms-to model them formally, devise precise specifications for their required behavior, prove their correctness, and evaluate their performance with realistic measures. Table of Contents 1 Introduction 2 Modelling I; Synchronous Network Model 3 Leader Election in a Synchronous Ring 4 Algorithms in General Synchronous Networks 5 Distributed Consensus with Link Failures 6 Distributed Consensus with Process Failures 7 More Consensus Problems 8 Modelling II: Asynchronous System Model 9 Modelling III: Asynchronous Shared Memory Model 10 Mutual Exclusion 11 Resource Allocation 12 Consensus 13 Atomic Objects 14 Modelling IV: Asynchronous Network Model 15 Basic Asynchronous Network Algorithms 16 Synchronizers 17 Shared Memory versus Networks 18 Logical Time 19 Global Snapshots and Stable Properties 20 Network Resource Allocation 21 Asynchronous Networks with Process Failures 22 Data Link Protocols 23 Partially Synchronous System Models 24 Mutual Exclusion with Partial Synchrony 25 Consensus with Partial Synchrony

...read moreread less

4,340 citations

Journal Article•DOI•

Extended-Connectivity Fingerprints

[...]

David Rogers¹, Mathew Hahn¹•Institutions (1)

Symyx Technologies¹

28 Apr 2010-Journal of Chemical Information and Modeling

TL;DR: A description of their implementation has not previously been presented in the literature, and ECFPs can be very rapidly calculated and can represent an essentially infinite number of different molecular features.

...read moreread less

Abstract: Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure−activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.

...read moreread less

4,173 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse