Home
/
Topics
/
Document processing

Topic

Document processing

About: Document processing is a research topic. Over the lifetime, 4174 publications have been published within this topic receiving 65885 citations.

...read moreread less

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2017
2016
2015
2014
2013
2012
2011
2010
2009
2008
2007
2006
2005
2004
2003
2002
2001
2000
1999
1998
1997
1996
1995
1994
1993
1992
1991
1990
1989
1988
1987
1986
1985
1984
1983
1982
1981
1980
1979
1978
1977
1976
1974
1973
1972
1971
1970
1969
1968

1 / 2

Papers

PDF

Open Access

More filters

Book•

The SMART Retrieval System—Experiments in Automatic Document Processing

[...]

Gerard Salton

01 Jan 1971

2,877 citations

N-gram-based text categorization

[...]

W.B. Cavnar, John M. Trenkle

31 Dec 1994

TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.

...read moreread less

Abstract: Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems. We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification rate. There are also several obvious directions for improving the system`s classification performance in those cases where it did not do as well. The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the variousmore » categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document`s profile and each of the category profiles. The system selects the category whose profile has the smallest distance to the document`s profile. The profiles involved are quite small, typically 10K bytes for a category training set, and less than 4K bytes for an individual document. Using N-gram frequency profiles provides a simple and reliable way to categorize documents in a wide range of classification tasks.« less

...read moreread less

1,826 citations

Journal Article•DOI•

The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]

[...]

Li Deng¹•Institutions (1)

Microsoft¹

18 Oct 2012-IEEE Signal Processing Magazine

TL;DR: “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.

...read moreread less

Abstract: In this issue, “Best of the Web” presents the modified National Institute of Standards and Technology (MNIST) resources, consisting of a collection of handwritten digit images used extensively in optical character recognition and machine learning research.

...read moreread less

1,626 citations

Patent•

Manual entry interactive paper and electronic document handling and processing system

[...]

Gregory J. Wolff¹, David G. Stork¹•Institutions (1)

Ricoh¹

01 Nov 1995

TL;DR: A pen-like instrument with a writing point for making written entries upon a physical document and sensing the three-dimensional forces exerted on the writing tip as well as the motion associated with the act of writing is described in this article.

...read moreread less

Abstract: A manual entry interactive paper and electronic document handling and process system uses a pen-like instrument (PI) with a writing point for making written entries upon a physical document and sensing the three-dimensional forces exerted on the writing tip as well as the motion associated with the act of writing. The PI is also equipped with a CCD array for reading pre-printed bar codes used for identifying document pages and other application defined areas on the page, as well as for providing optical character recognition data. A communication link between the PI and an associated base unit transfers the transducer data from the PI. The base unit includes a programmable processor, a display, and a communication link receiver. The processor includes programs for written character and word recognition, memory for storage of an electronic version of the physical document and any hand-written additions to the document. The display unit displays the corresponding electronic version of the physical document on a CRT or LCD as a means of feedback to the user and for use by authorized electronic agents.

...read moreread less

1,024 citations

Journal Article•DOI•

The document spectrum for page layout analysis

[...]

Lawrence O'Gorman¹•Institutions (1)

Bell Labs¹

01 Nov 1993-IEEE Transactions on Pattern Analysis and Machine Intelligence

TL;DR: The document spectrum (or docstrum) as discussed by the authors is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components, which yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks.

...read moreread less

Abstract: Page layout analysis is a document processing technique used to determine the format of a page. This paper describes the document spectrum (or docstrum), which is a method for structural page layout analysis based on bottom-up, nearest-neighbor clustering of page components. The method yields an accurate measure of skew, within-line, and between-line spacings and locates text lines and text blocks. It is advantageous over many other methods in three main ways: independence from skew angle, independence from different text spacings, and the ability to process local regions of different text orientations within the same image. Results of the method shown for several different page formats and for randomly oriented subpages on the same image illustrate the versatility of the method. We also discuss the differences, advantages, and disadvantages of the docstrum with respect to other lay-out methods. >

...read moreread less

654 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

Network Information

Performance

Metrics

4,228

Papers

68,852

Citations

No. of papers in the topic in previous years
Year	Papers
2023	14
2022	40
2021	31
2020	67
2019	67
2018	58

Document processing

Papers published on a yearly basis

Papers

Trending Questions (10)

Network Information

Related Topics (5)

Performance

Metrics