scispace - formally typeset
Search or ask a question
Author

James Allan

Bio: James Allan is an academic researcher from University of Massachusetts Amherst. The author has contributed to research in topics: Relevance (information retrieval) & Query expansion. The author has an hindex of 64, co-authored 303 publications receiving 16518 citations. Previous affiliations of James Allan include Cornell University & Information Sciences Institute.


Papers
More filters
DOI
01 Jan 1998
TL;DR: Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories.
Abstract: Topic Detection and Tracking (TDT) is a DARPA-sponsored initiative to investigate the state of the art in finding and following new events in a stream of broadcast news stories The TDT problem consists of three major tasks: (1) segmenting a stream of data, especially recognized speech, into distinct stories; (2) identifying those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream

1,097 citations

Book
01 Jan 2002
TL;DR: This collection of technical papers from leading researchers in the field not only provides several chapters devoted to the research program and its evaluation paradigm, but also presents the most current research results and describes some of the remaining open challenges.
Abstract: Topic Detection and Tracking: Event-based Information Organization brings together in one place state-of-the-art research in Topic Detection and Tracking (TDT). This collection of technical papers from leading researchers in the field not only provides several chapters devoted to the research program and its evaluation paradigm, but also presents the most current research results and describes some of the remaining open challenges. Topic Detection and Tracking: Event-based Information Organization is an excellent reference for researchers and practitioners in a variety of fields related to TDT, including information retrieval, automatic speech recognition, machine learning, and information extraction

872 citations

Proceedings ArticleDOI
06 Nov 2007
TL;DR: It is discovered that there is little practical difference between the randomization, bootstrap, and t tests and their use should be discontinued for measuring the significance of a difference between means.
Abstract: Information retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's randomization (permutation) test as non-parametric significance tests for IR but these tests have seen little use. For each of these five tests, we took the ad-hoc retrieval runs submitted to TRECs 3 and 5-8, and for each pair of runs, we measured the statistical significance of the difference in their mean average precision. We discovered that there is little practical difference between the randomization, bootstrap, and t tests. Both the Wilcoxon and sign test have a poor ability to detect significance and have the potential to lead to false detections of significance. The Wilcoxon and sign tests are simplified variants of the randomization test and their use should be discontinued for measuring the significance of a difference between means.

728 citations

Journal ArticleDOI
01 Aug 1998

704 citations

Proceedings Article
01 Jan 1994
TL;DR: This work continues the work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments, with a major focus on massive query expansion, adding from 300 to 530 terms to each query.
Abstract: The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 3, performing runs in the routing, ad-hoc, and foreign language environments. Our major focus is massive query expansion : adding from 300 to 530 terms to each query. These terms come from known relevant documents in the case of routing, and from just the top retrieved documents in the case of ad-hoc and Spanish. This approach improves effectiveness from 7% to 25% in the various experiments. Other ad-hoc work extends our investigations into combining global similarities, giving an overall indication of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. Using an overlapping text window definition of local, we achieve a 16% improvement.

579 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Abstract: Machine Learning is the study of methods for programming computers to learn. Computers are applied to a wide range of tasks, and for most of these it is relatively easy for programmers to design and implement the necessary software. However, there are many tasks for which this is difficult or impossible. These can be divided into four general categories. First, there are problems for which there exist no human experts. For example, in modern automated manufacturing facilities, there is a need to predict machine failures before they occur by analyzing sensor readings. Because the machines are new, there are no human experts who can be interviewed by a programmer to provide the knowledge necessary to build a computer system. A machine learning system can study recorded data and subsequent machine failures and learn prediction rules. Second, there are problems where human experts exist, but where they are unable to explain their expertise. This is the case in many perceptual tasks, such as speech recognition, hand-writing recognition, and natural language understanding. Virtually all humans exhibit expert-level abilities on these tasks, but none of them can describe the detailed steps that they follow as they perform them. Fortunately, humans can provide machines with examples of the inputs and correct outputs for these tasks, so machine learning algorithms can learn to map the inputs to the outputs. Third, there are problems where phenomena are changing rapidly. In finance, for example, people would like to predict the future behavior of the stock market, of consumer purchases, or of exchange rates. These behaviors change frequently, so that even if a programmer could construct a good predictive computer program, it would need to be rewritten frequently. A learning program can relieve the programmer of this burden by constantly modifying and tuning a set of learned prediction rules. Fourth, there are applications that need to be customized for each computer user separately. Consider, for example, a program to filter unwanted electronic mail messages. Different users will need different filters. It is unreasonable to expect each user to program his or her own rules, and it is infeasible to provide every user with a software engineer to keep the rules up-to-date. A machine learning system can learn which mail messages the user rejects and maintain the filtering rules automatically. Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis. Statistics focuses on understanding the phenomena that have generated the data, often with the goal of testing different hypotheses about those phenomena. Data mining seeks to find patterns in the data that are understandable by people. Psychological studies of human learning aspire to understand the mechanisms underlying the various learning behaviors exhibited by people (concept learning, skill acquisition, strategy change, etc.).

13,246 citations

Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations

01 Jan 2002

9,314 citations

Book
28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,295 citations