Home
/
Authors
/
Ray Smith

Author

Ray Smith

Bio: Ray Smith is an academic researcher from Google. The author has contributed to research in topics: Optical character recognition & Tesseract. The author has an hindex of 15, co-authored 26 publications receiving 1998 citations.

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

An Overview of the Tesseract OCR Engine

[...]

Ray Smith¹•Institutions (1)

Google¹

23 Sep 2007

TL;DR: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview.

...read moreread less

Abstract: The Tesseract OCR engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier.

...read moreread less

1,530 citations

Proceedings Article•DOI•

Table detection in heterogeneous documents

[...]

Faisal Shafait¹, Ray Smith²•Institutions (2)

German Research Centre for Artificial Intelligence¹, Google²

09 Jun 2010

TL;DR: Evaluation of the algorithm on document images from publicly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system.

...read moreread less

Abstract: Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly focus on detecting tables in single columns of text and do not work reliably on documents with varying layouts. This paper presents a practical algorithm for table detection that works with a high accuracy on documents with varying layouts (company reports, newspaper articles, magazine pages, ...). An open source implementation of the algorithm is provided as part of the Tesseract OCR engine. Evaluation of the algorithm on document images from publicly available UNLV dataset shows competitive performance in comparison to the table detection module of a commercial OCR system.

...read moreread less

122 citations

Proceedings Article•DOI•

Adapting the Tesseract open source OCR engine for multilingual OCR

[...]

Ray Smith¹, Daria Antonova¹, Dar-Shyang Lee¹•Institutions (1)

Google¹

25 Jul 2009

TL;DR: Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text.

...read moreread less

Abstract: We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

...read moreread less

117 citations

Proceedings Article•DOI•

Hybrid Page Layout Analysis via Tab-Stop Detection

[...]

Ray Smith¹•Institutions (1)

Google¹

26 Jul 2009

TL;DR: A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted.

...read moreread less

Abstract: A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at http://code.google.com/p/tesseract-ocr.

...read moreread less

104 citations

Patent•

Shape clustering in post optical character recognition processing

[...]

Luc Vincent¹, Ray Smith¹•Institutions (1)

Google¹

15 Jul 2011

TL;DR: In this paper, the authors present techniques for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process, including documents.

...read moreread less

Abstract: Techniques for shape clustering and applications in processing various documents, including an output of an optical character recognition (OCR) process.

...read moreread less

91 citations

1
2
3
4
…
5
6

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Quantitative analysis of culture using millions of digitized books

[...]

Jean-Baptiste Michel¹, Yuan Kui Shen², Yuan Kui Shen¹, Aviva Presser Aiden¹, Adrian Veres¹, Matthew K. Gray³, Joseph P. Pickett, Dale Hoiberg, Dan Clancy³, Peter Norvig³, Jon Orwant³, Steven Pinker¹, Martin A. Nowak¹, Erez Lieberman Aiden - Show less +10 more•Institutions (3)

Harvard University¹, Massachusetts Institute of Technology², Google³

14 Jan 2011-Science

TL;DR: This work surveys the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000, and shows how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology and the pursuit of fame.

...read moreread less

Abstract: We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of 'culturomics,' focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.

...read moreread less

2,257 citations

Patent•

Display screen or portion thereof with graphical user interface

[...]

Jang-Won Seo, Yong-Hwan Kwon, Jieun Kim, Jihong Kim, Hyeryung Kim, Seran Jeon, Woo-seok Hwang - Show less +3 more

14 Jun 2016

TL;DR: Newness and distinctiveness is claimed in the features of ornamentation as shown inside the broken line circle in the accompanying representation as discussed by the authors, which is the basis for the representation presented in this paper.

...read moreread less

Abstract: Newness and distinctiveness is claimed in the features of ornamentation as shown inside the broken line circle in the accompanying representation.

...read moreread less

1,500 citations

Proceedings Article•DOI•

Outside the Closed World: On Using Machine Learning for Network Intrusion Detection

[...]

Robin Sommer¹, Vern Paxson²•Institutions (2)

Lawrence Berkeley National Laboratory¹, University of California, Berkeley²

16 May 2010

TL;DR: The main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively.

...read moreread less

Abstract: In network intrusion detection research, one popular strategy for finding attacks is monitoring a network's activity for anomalies: deviations from profiles of normality previously learned from benign traffic, typically identified using tools borrowed from the machine learning community However, despite extensive academic research one finds a striking gap in terms of actual deployments of such systems: compared with other intrusion detection approaches, machine learning is rarely employed in operational "real world" settings We examine the differences between the network intrusion detection problem and other areas where machine learning regularly finds much more success Our main claim is that the task of finding attacks is fundamentally different from these other applications, making it significantly harder for the intrusion detection community to employ machine learning effectively We support this claim by identifying challenges particular to network intrusion detection, and provide a set of guidelines meant to strengthen future research on anomaly detection

...read moreread less

1,377 citations

Journal Article•DOI•

Text Detection and Recognition in Imagery: A Survey

[...]

Qixiang Ye, David Doermann¹•Institutions (1)

University of Maryland, College Park¹

01 Jul 2015-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: This review provides a fundamental comparison and analysis of the remaining problems in the field and summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems.

...read moreread less

Abstract: This paper analyzes, compares, and contrasts technical challenges, methods, and the performance of text detection and recognition research in color imagery It summarizes the fundamental problems and enumerates factors that should be considered when addressing these problems Existing techniques are categorized as either stepwise or integrated and sub-problems are highlighted including text localization, verification, segmentation and recognition Special issues associated with the enhancement of degraded text and the processing of video text, multi-oriented, perspectively distorted and multilingual text are also addressed The categories and sub-categories of text are illustrated, benchmark datasets are enumerated, and the performance of the most representative approaches is compared This review provides a fundamental comparison and analysis of the remaining problems in the field

...read moreread less

709 citations

Journal Article•DOI•

MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports

[...]

Alistair E. W. Johnson¹, Tom J. Pollard¹, Seth A. Berkowitz², Nathaniel R. Greenbaum², Matthew P. Lungren³, Chih-Ying Deng⁴, Roger G. Mark¹, Steven Horng² - Show less +4 more•Institutions (4)

Massachusetts Institute of Technology¹, Beth Israel Deaconess Medical Center², Stanford University³, Harvard University⁴

12 Dec 2019-Scientific Data

TL;DR: A large dataset of 227,835 imaging studies for 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011–2016 is described, making freely available to facilitate and encourage a wide range of research in computer vision, natural language processing, and clinical data mining.

...read moreread less

Abstract: Chest radiography is an extremely powerful imaging modality, allowing for a detailed inspection of a patient's chest, but requires specialized training for proper interpretation. With the advent of high performance general purpose computer vision algorithms, the accurate automated analysis of chest radiographs is becoming increasingly of interest to researchers. Here we describe MIMIC-CXR, a large dataset of 227,835 imaging studies for 65,379 patients presenting to the Beth Israel Deaconess Medical Center Emergency Department between 2011-2016. Each imaging study can contain one or more images, usually a frontal view and a lateral view. A total of 377,110 images are available in the dataset. Studies are made available with a semi-structured free-text radiology report that describes the radiological findings of the images, written by a practicing radiologist contemporaneously during routine clinical care. All images and reports have been de-identified to protect patient privacy. The dataset is made freely available to facilitate and encourage a wide range of research in computer vision, natural language processing, and clinical data mining.

...read moreread less

504 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse