scispace - formally typeset
Search or ask a question
Author

David Newman

Other affiliations: NICTA, Google
Bio: David Newman is an academic researcher from University of California, Irvine. The author has contributed to research in topics: Topic model & Latent Dirichlet allocation. The author has an hindex of 26, co-authored 43 publications receiving 6118 citations. Previous affiliations of David Newman include NICTA & Google.

Papers
More filters
Journal ArticleDOI
TL;DR: The model underestimates transport and deposition of East Asian and Australian dust to some regions of the Pacific Ocean as mentioned in this paper, and an underestimate of long-range transport of particles larger than 3 mm contributes to this bias.
Abstract: 17 ± 2 Tg; and optical depth at 0.63 mm, 0.030 ± 0.004. This emission, burden, and optical depth are significantly lower than some recent estimates. The model underestimates transport and deposition of East Asian and Australian dust to some regions of the Pacific Ocean. An underestimate of long-range transport of particles larger than 3 mm contributes to this bias. Our experiments support the hypothesis that dust emission ‘‘hot spots’’ exist in regions where alluvial sediments have accumulated and may be disturbed. INDEX TERMS: 0305 Atmospheric Composition and Structure: Aerosols and particles (0345, 4801); 0322 Atmospheric Composition and Structure: Constituent sources and sinks; 4801 Oceanography: Biological and Chemical: Aerosols (0305); 5415 Planetology: Solid Surface Planets: Erosion and weathering; KEYWORDS: mineral dust aerosol, aerosol climatology, mineral deposition, aerosol scavenging, saltation sandblasting, ecosystem fertilization

1,054 citations

Proceedings Article
02 Jun 2010
TL;DR: A simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results.
Abstract: This paper introduces the novel task of topic coherence evaluation, whereby a set of words, as generated by a topic model, is rated for coherence or interpretability. We apply a range of topic scoring models to the evaluation task, drawing on WordNet, Wikipedia and the Google search engine, and existing research on lexical similarity/relatedness. In comparison with human scores for a set of learned topics over two distinct datasets, we show a simple co-occurrence measure based on pointwise mutual information over Wikipedia data is able to achieve results for the task at or nearing the level of inter-annotator correlation, and that other Wikipedia-based lexical relatedness methods also achieve strong results. Google produces strong, if less consistent, results, while our results over WordNet are patchy at best.

832 citations

Proceedings ArticleDOI
24 Aug 2008
TL;DR: A novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model, which can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA and results in significant speedups on real world text corpora.
Abstract: In this paper we introduce a novel collapsed Gibbs sampling method for the widely used latent Dirichlet allocation (LDA) model. Our new method results in significant speedups on real world text corpora. Conventional Gibbs sampling schemes for LDA require O(K) operations per sample where K is the number of topics in the model. Our proposed method draws equivalent samples but requires on average significantly less then K operations per sample. On real-word corpora FastLDA can be as much as 8 times faster than the standard collapsed Gibbs sampler for LDA. No approximations are necessary, and we show that our fast sampling scheme produces exactly the same results as the standard (but slower) sampling scheme. Experiments on four real world data sets demonstrate speedups for a wide range of collection sizes. For the PubMed collection of over 8 million documents with a required computation time of 6 CPU months for LDA, our speedup of 5.7 can save 5 CPU months of computation.

591 citations

ReportDOI
04 Dec 2006
TL;DR: This paper proposes the collapsed variational Bayesian inference algorithm for LDA, and shows that it is computationally efficient, easy to implement and significantly more accurate than standard variationalBayesian inference for L DA.
Abstract: Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference procedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and significantly more accurate than standard variational Bayesian inference for LDA.

561 citations

Proceedings ArticleDOI
01 Apr 2014
TL;DR: This work explores the two tasks of automatic Evaluation of single topics and automatic evaluation of whole topic models, and provides recommendations on the best strategy for performing the two task, in addition to providing an open-source toolkit for topic and topic model evaluation.
Abstract: Topic models based on latent Dirichlet allocation and related methods are used in a range of user-focused tasks including document navigation and trend analysis, but evaluation of the intrinsic quality of the topic model and topics remains an open research area. In this work, we explore the two tasks of automatic evaluation of single topics and automatic evaluation of whole topic models, and provide recommendations on the best strategy for performing the two tasks, in addition to providing an open-source toolkit for topic and topic model evaluation.

493 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Surveying a suite of algorithms that offer a solution to managing large document archives suggests they are well-suited to handle large amounts of data.
Abstract: Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling:(1) Topic modeling assumptions(2) Algorithms for computing with topic models(3) Applications of topic modelsIn (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership.In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream.In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations.Finally, I will discuss some future directions and open research problems in topic models.

4,529 citations

Journal ArticleDOI
TL;DR: Although combinatorial chemistry techniques have succeeded as methods of optimizing structures and have been used very successfully in the optimization of many recently approved agents, they are still able to identify only two de novo combinatorials compounds approved as drugs in this 39-year time frame.
Abstract: This review is an updated and expanded version of the five prior reviews that were published in this journal in 1997, 2003, 2007, 2012, and 2016. For all approved therapeutic agents, the time frame has been extended to cover the almost 39 years from the first of January 1981 to the 30th of September 2019 for all diseases worldwide and from ∼1946 (earliest so far identified) to the 30th of September 2019 for all approved antitumor drugs worldwide. As in earlier reviews, only the first approval of any drug is counted, irrespective of how many "biosimilars" or added approvals were subsequently identified. As in the 2012 and 2016 reviews, we have continued to utilize our secondary subdivision of a "natural product mimic", or "NM", to join the original primary divisions, and the designation "natural product botanical", or "NB", to cover those botanical "defined mixtures" now recognized as drug entities by the FDA (and similar organizations). From the data presented in this review, the utilization of natural products and/or synthetic variations using their novel structures, in order to discover and develop the final drug entity, is still alive and well. For example, in the area of cancer, over the time frame from 1946 to 1980, of the 75 small molecules, 40, or 53.3%, are N or ND. In the 1981 to date time frame the equivalent figures for the N* compounds of the 185 small molecules are 62, or 33.5%, though to these can be added the 58 S* and S*/NMs, bringing the figure to 64.9%. In other areas, the influence of natural product structures is quite marked with, as expected from prior information, the anti-infective area being dependent on natural products and their structures, though as can be seen in the review there are still disease areas (shown in Table 2) for which there are no drugs derived from natural products. Although combinatorial chemistry techniques have succeeded as methods of optimizing structures and have been used very successfully in the optimization of many recently approved agents, we are still able to identify only two de novo combinatorial compounds (one of which is a little speculative) approved as drugs in this 39-year time frame, though there is also one drug that was developed using the "fragment-binding methodology" and approved in 2012. We have also added a discussion of candidate drug entities currently in clinical trials as "warheads" and some very interesting preliminary reports on sources of novel antibiotics from Nature due to the absolute requirement for new agents to combat plasmid-borne resistance genes now in the general populace. We continue to draw the attention of readers to the recognition that a significant number of natural product drugs/leads are actually produced by microbes and/or microbial interactions with the "host from whence it was isolated"; thus we consider that this area of natural product research should be expanded significantly.

2,560 citations

Journal ArticleDOI
01 Apr 2005-Science
TL;DR: The iron cycle, in which iron-containing soil dust is transported from land through the atmosphere to the oceans, affecting ocean biogeochemistry and hence having feedback effects on climate and dust production, is reviewed.
Abstract: The environmental conditions of Earth, including the climate, are determined by physical, chemical, biological, and human interactions that transform and transport materials and energy. This is the "Earth system": a highly complex entity characterized by multiple nonlinear responses and thresholds, with linkages between disparate components. One important part of this system is the iron cycle, in which iron-containing soil dust is transported from land through the atmosphere to the oceans, affecting ocean biogeochemistry and hence having feedback effects on climate and dust production. Here we review the key components of this cycle, identifying critical uncertainties and priorities for future research.

2,475 citations

Proceedings ArticleDOI
25 Jun 2006
TL;DR: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections, and dynamic topic models provide a qualitative window into the contents of a large document collection.
Abstract: A family of probabilistic time series models is developed to analyze the time evolution of topics in large document collections. The approach is to use state space models on the natural parameters of the multinomial distributions that represent the topics. Variational approximations based on Kalman filters and nonparametric wavelet regression are developed to carry out approximate posterior inference over the latent topics. In addition to giving quantitative, predictive models of a sequential corpus, dynamic topic models provide a qualitative window into the contents of a large document collection. The models are demonstrated by analyzing the OCR'ed archives of the journal Science from 1880 through 2000.

2,410 citations

Journal ArticleDOI
TL;DR: Stochastic variational inference lets us apply complex Bayesian models to massive data sets, and it is shown that the Bayesian nonparametric topic model outperforms its parametric counterpart.
Abstract: We develop stochastic variational inference, a scalable algorithm for approximating posterior distributions. We develop this technique for a large class of probabilistic models and we demonstrate it with two probabilistic topic models, latent Dirichlet allocation and the hierarchical Dirichlet process topic model. Using stochastic variational inference, we analyze several large collections of documents: 300K articles from Nature, 1.8M articles from The New York Times, and 3.8M articles from Wikipedia. Stochastic inference can easily handle data sets of this size and outperforms traditional variational inference, which can only handle a smaller subset. (We also show that the Bayesian nonparametric topic model outperforms its parametric counterpart.) Stochastic variational inference lets us apply complex Bayesian models to massive data sets.

2,291 citations