scispace - formally typeset
Search or ask a question
Author

John M. Trenkle

Bio: John M. Trenkle is an academic researcher from Environmental Research Institute of Michigan. The author has contributed to research in topics: Feature (machine learning) & Feature extraction. The author has an hindex of 8, co-authored 14 publications receiving 2012 citations.

Papers
More filters
31 Dec 1994
TL;DR: An N-gram-based approach to text categorization that is tolerant of textual errors is described, which worked very well for language classification and worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject.
Abstract: Text categorization is a fundamental task in document processing, allowing the automated handling of enormous streams of documents in electronic form. One difficulty in handling some classes of documents is the presence of different kinds of textual errors, such as spelling and grammatical errors in email, and character recognition errors in documents that come through OCR. Text categorization must work reliably on all input, and thus must tolerate some level of these kinds of problems. We describe here an N-gram-based approach to text categorization that is tolerant of textual errors. The system is small, fast and robust. This system worked very well for language classification, achieving in one test a 99.8% correct classification rate on Usenet newsgroup articles written in different languages. The system also worked reasonably well for classifying articles from a number of different computer-oriented newsgroups according to subject, achieving as high as an 80% correct classification rate. There are also several obvious directions for improving the system`s classification performance in those cases where it did not do as well. The system is based on calculating and comparing profiles of N-gram frequencies. First, we use the system to compute profiles on training set data that represent the variousmore » categories, e.g., language samples or newsgroup content samples. Then the system computes a profile for a particular document that is to be classified. Finally, the system computes a distance measure between the document`s profile and each of the category profiles. The system selects the category whose profile has the smallest distance to the document`s profile. The profiles involved are quite small, typically 10K bytes for a category training set, and less than 4K bytes for an individual document. Using N-gram frequency profiles provides a simple and reliable way to categorize documents in a wide range of classification tasks.« less

1,826 citations

Journal ArticleDOI
TL;DR: The use of microarrays of tumor‐derived proteins to profile the antibody repertoire in the sera of prostate cancer patients and controls suggests that microarray of fractionated proteins could be a powerful tool for tumor antigen discovery and cancer diagnosis.
Abstract: The broad characterization of the immune responses elicited by tumors has valuable applications in diagnostics and basic research. We present here the use of microarrays of tumor-derived proteins to profile the antibody repertoire in the sera of prostate cancer patients and controls. Two-dimensional liquid chromatography was used to separate proteins from the prostate cancer cell line LNCaP into 1760 fractions. These fractions were spotted in microarrays on coated microscope slides, and the microarrays were incubated individually with serum samples from 25 men with prostate cancer and 25 male controls. The amount of immunoglobulin bound to each fraction by each serum sample was quantified. Statistical analysis revealed that 38 of the fractions had significantly higher levels of immunoglobulinbinding in the prostate cancer samples compared to the controls. Two fractions showed higher binding in the control samples. The significantly higher immunoglobulin reactivity from the prostate cancer samples may reflect a strong immune response to the tumors in the prostate cancer patients. We used multivariate analysis to classify the samples as either prostate cancer or control. In a crossvalidation study, recursive partitioningclassified the samples with 84% accuracy. A decision tree with two levels of partitioning classified the samples with 98% accuracy. Additional studies will allow further characterization of tumor antigens in prostate cancer and their significance for diagnosis. These results suggest that microarrays of fractionated proteins could be a powerful tool for tumor antigen discovery and cancer diagnosis.

92 citations

PatentDOI
11 Jul 1997
TL;DR: In this paper, a method for acquisition, mosaicking, cueing and interactive review of large-scale transmission electron micrograph composite images is described, where individual frames are automatically registered and mosaiced together into a single virtual image composite, which is then used to perform automatic cueing of axons and axon clusters.
Abstract: A method is described for acquisition, mosaicking, cueing and interactive review of large-scale transmission electron micrograph composite images. Individual frames are automatically registered and mosaiced together into a single virtual image composite, which is then used to perform automatic cueing of axons and axon clusters, as well as review and marking by qualified neuroanatomists. Statistics derived from the review process were used to evaluate the efficacy of the drug in promoting regeneration of myelinated nerve fibers.

66 citations

01 Jan 2007
TL;DR: A system for the recognition of Arabic text in document images that is designed to perform well on low resolution and low quality document images.
Abstract: Andrew Gillies, Erik Erlandson, John Trenkle, Steve SchlosserNonlinear Dynamics Incorporated123 N. Ashley Street, Suite 120Ann Arbor, MI 48104AbstractThis paper describes a system for the recognition of Arabic text in document images.The system is designed to perform well on low resolution and low quality documentimages. On a set of 138 page images digitized at 200x200 dpi the system achieved a93% correct character recognition rate. On the same pages digitized at 100x200 dpi, thesystem achieved an 89% character recognition rate. The systems processes a typicalpage with simple layout and 45 lines of text in 90 seconds on a 400 Mhz Pentium IIrunning Linux.

31 citations

Proceedings ArticleDOI
07 Mar 1996
TL;DR: A word-level recognition system for machine-printed Arabic text has been implemented and has obtained promising word recognition rates on low-quality multifont text imagery.
Abstract: Many text recognition systems recognize text imagery at the character level and assemble words from the recognized characters. An alternative approach is to recognize text imagery at the word level, without analyzing individual characters. This approach avoids the problem of individual character segmentation, and can overcome local errors in character recognition. A word-level recognition system for machine-printed Arabic text has been implemented. Arabic is a script language, and is therefore difficult to segment at the character level. Character segmentation has been avoided by recognizing text imagery of complete words. The Arabic recognition system computes a vector of image-morphological features on a query word image. This vector is matched against a precomputed database of vectors from a lexicon of Arabic words. Vectors from the database with the highest match score are returned as hypotheses for the unknown image. Several feature vectors may be stored for each word in the database. Database feature vectors generated using multiple fonts and noise models allow the system to be tuned to its input stream. Used in conjunction with database pruning techniques, this Arabic recognition system has obtained promising word recognition rates on low-quality multifont text imagery.

20 citations


Cited by
More filters
Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations

Journal ArticleDOI
TL;DR: This survey discusses the main approaches to text categorization that fall within the machine learning paradigm and discusses in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.
Abstract: The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last 10 years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert labor power, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely, document representation, classifier construction, and classifier evaluation.

7,539 citations

01 Aug 2000
TL;DR: Assessment of medical technology in the context of commercialization with Bioentrepreneur course, which addresses many issues unique to biomedical products.
Abstract: BIOE 402. Medical Technology Assessment. 2 or 3 hours. Bioentrepreneur course. Assessment of medical technology in the context of commercialization. Objectives, competition, market share, funding, pricing, manufacturing, growth, and intellectual property; many issues unique to biomedical products. Course Information: 2 undergraduate hours. 3 graduate hours. Prerequisite(s): Junior standing or above and consent of the instructor.

4,833 citations

01 May 2005

2,648 citations

Journal ArticleDOI
TL;DR: The tm package is presented which provides a framework for text mining applications within R and techniques for count-based analysis methods, text clustering, text classification and string kernels are presented.
Abstract: During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

1,057 citations