scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Finding scientific topics

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.
Abstract: A first step in identifying the content of a document is determining which topics that document addresses. We describe a generative model for documents, introduced by Blei, Ng, and Jordan [Blei, D. M., Ng, A. Y. & Jordan, M. I. (2003) J. Machine Learn. Res. 3, 993-1022], in which each document is generated by choosing a distribution over topics and then choosing each word in the document from a topic selected according to this distribution. We then present a Markov chain Monte Carlo algorithm for inference in this model. We use this algorithm to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics. We show that the extracted topics capture meaningful structure in the data, consistent with the class designations provided by the authors of the articles, and outline further applications of this analysis, including identifying “hot topics” by examining temporal dynamics and tagging abstracts to illustrate semantic content.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings Article
03 Jan 2001
TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Abstract: We propose a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams [6], and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI) [3]. In the context of text modeling, our model posits that each document is generated as a mixture of topics, where the continuous-valued mixture proportions are distributed as a latent Dirichlet random variable. Inference and learning are carried out efficiently via variational algorithms. We present empirical results on applications of this model to problems in text modeling, collaborative filtering, and text classification.

25,546 citations

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open source scientific computing library for the Python programming language, which includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics.
Abstract: SciPy is an open source scientific computing library for the Python programming language. SciPy 1.0 was released in late 2017, about 16 years after the original version 0.1 release. SciPy has become a de facto standard for leveraging scientific algorithms in the Python programming language, with more than 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories, and millions of downloads per year. This includes usage of SciPy in almost half of all machine learning projects on GitHub, and usage by high profile projects including LIGO gravitational wave analysis and creation of the first-ever image of a black hole (M87). The library includes functionality spanning clustering, Fourier transforms, integration, interpolation, file I/O, linear algebra, image processing, orthogonal distance regression, minimization algorithms, signal processing, sparse matrix handling, computational geometry, and statistics. In this work, we provide an overview of the capabilities and development practices of the SciPy library and highlight some recent technical developments.

12,774 citations

Book
24 Aug 2012
TL;DR: This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach, and is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.
Abstract: Today's Web-enabled deluge of electronic data calls for automated methods of data analysis. Machine learning provides these, developing methods that can automatically detect patterns in data and then use the uncovered patterns to predict future data. This textbook offers a comprehensive and self-contained introduction to the field of machine learning, based on a unified, probabilistic approach. The coverage combines breadth and depth, offering necessary background material on such topics as probability, optimization, and linear algebra as well as discussion of recent developments in the field, including conditional random fields, L1 regularization, and deep learning. The book is written in an informal, accessible style, complete with pseudo-code for the most important algorithms. All topics are copiously illustrated with color images and worked examples drawn from such application domains as biology, text processing, computer vision, and robotics. Rather than providing a cookbook of different heuristic methods, the book stresses a principled model-based approach, often using the language of graphical models to specify models in a concise and intuitive way. Almost all the models described have been implemented in a MATLAB software package--PMTK (probabilistic modeling toolkit)--that is freely available online. The book is suitable for upper-level undergraduates with an introductory-level college math background and beginning graduate students.

8,059 citations


Cites methods from "Finding scientific topics"

  • ...This approach was first suggested in (Griffiths and Steyvers 2004), and is an example of collapsed Gibbs sampling....

    [...]

Journal ArticleDOI
TL;DR: SciPy as discussed by the authors is an open-source scientific computing library for the Python programming language, which has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year.
Abstract: SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments.

6,244 citations

22 May 2010
TL;DR: This work describes a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion, and implements several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation in a way that makes them completely independent of the training corpus size.
Abstract: Large corpora are ubiquitous in today's world and memory quickly becomes the limiting factor in practical applications of the Vector Space Model (VSM). We identify gap in existing VSM implementations, which is their scalability and ease of use. We describe a Natural Language Processing software framework which is based on the idea of document streaming, i.e. processing corpora document after document, in a memory independent fashion. In this framework, we implement several popular algorithms for topical inference, including Latent Semantic Analysis and Latent Dirichlet Allocation, in a way that makes them completely independent of the training corpus size. Particular emphasis is placed on straightforward and intuitive framework design, so that modifications and extensions of the methods and/or their application by interested practitioners are effortless. We demonstrate the usefulness of our approach on a real-world scenario of computing document similarities within an existing digital library DML-CZ.

3,965 citations


Cites methods from "Finding scientific topics"

  • ..., 2003) or by Gibbs sampling (Griffiths and Steyvers, 2004)....

    [...]

  • ...This can be accomplished by variational Bayes approximations (Blei et al., 2003) or by Gibbs sampling (Griffiths and Steyvers, 2004)....

    [...]

References
More filters
Book
01 Jan 1962
TL;DR: The Structure of Scientific Revolutions as discussed by the authors is a seminal work in the history of science and philosophy of science, and it has been widely cited as a major source of inspiration for the present generation of scientists.
Abstract: A good book may have the power to change the way we see the world, but a great book actually becomes part of our daily consciousness, pervading our thinking to the point that we take it for granted, and we forget how provocative and challenging its ideas once were-and still are. "The Structure of Scientific Revolutions" is that kind of book. When it was first published in 1962, it was a landmark event in the history and philosophy of science. And fifty years later, it still has many lessons to teach. With "The Structure of Scientific Revolutions", Kuhn challenged long-standing linear notions of scientific progress, arguing that transformative ideas don't arise from the day-to-day, gradual process of experimentation and data accumulation, but that revolutions in science, those breakthrough moments that disrupt accepted thinking and offer unanticipated ideas, occur outside of "normal science," as he called it. Though Kuhn was writing when physics ruled the sciences, his ideas on how scientific revolutions bring order to the anomalies that amass over time in research experiments are still instructive in our biotech age. This new edition of Kuhn's essential work in the history of science includes an insightful introductory essay by Ian Hacking that clarifies terms popularized by Kuhn, including paradigm and incommensurability, and applies Kuhn's ideas to the science of today. Usefully keyed to the separate sections of the book, Hacking's essay provides important background information as well as a contemporary context. Newly designed, with an expanded index, this edition will be eagerly welcomed by the next generation of readers seeking to understand the history of our perspectives on science.

36,808 citations

Journal ArticleDOI
TL;DR: The analogy between images and statistical mechanics systems is made and the analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations, creating a highly parallel ``relaxation'' algorithm for MAP estimation.
Abstract: We make an analogy between images and statistical mechanics systems. Pixel gray levels and the presence and orientation of edges are viewed as states of atoms or molecules in a lattice-like physical system. The assignment of an energy function in the physical system determines its Gibbs distribution. Because of the Gibbs distribution, Markov random field (MRF) equivalence, this assignment also determines an MRF image model. The energy function is a more convenient and natural mechanism for embodying picture attributes than are the local characteristics of the MRF. For a range of degradation mechanisms, including blurring, nonlinear deformations, and multiplicative or additive noise, the posterior distribution is an MRF with a structure akin to the image model. By the analogy, the posterior distribution defines another (imaginary) physical system. Gradual temperature reduction in the physical system isolates low energy states (``annealing''), or what is the same thing, the most probable states under the Gibbs distribution. The analogous operation under the posterior distribution yields the maximum a posteriori (MAP) estimate of the image given the degraded observations. The result is a highly parallel ``relaxation'' algorithm for MAP estimation. We establish convergence properties of the algorithm and we experiment with some simple pictures, for which good restorations are obtained at low signal-to-noise ratios.

18,761 citations

Book
28 May 1999
TL;DR: This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear and provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations.
Abstract: Statistical approaches to processing natural language text have become dominant in recent years This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear The book contains all the theory and algorithms needed for building NLP tools It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications

9,295 citations


"Finding scientific topics" refers background in this paper

  • ...Perplexity is a standard measure of performance for statistical models of natural language (14) and is defined as exp{)log P(wtest"")#ntest}, where wtest and ntest indicate the identities and number of words in the test set, respectively....

    [...]

BookDOI
TL;DR: The Markov Chain Monte Carlo Implementation Results Summary and Discussion MEDICAL MONITORING Introduction Modelling Medical Monitoring Computing Posterior Distributions Forecasting Model Criticism Illustrative Application Discussion MCMC for NONLINEAR HIERARCHICAL MODELS.
Abstract: INTRODUCING MARKOV CHAIN MONTE CARLO Introduction The Problem Markov Chain Monte Carlo Implementation Discussion HEPATITIS B: A CASE STUDY IN MCMC METHODS Introduction Hepatitis B Immunization Modelling Fitting a Model Using Gibbs Sampling Model Elaboration Conclusion MARKOV CHAIN CONCEPTS RELATED TO SAMPLING ALGORITHMS Markov Chains Rates of Convergence Estimation The Gibbs Sampler and Metropolis-Hastings Algorithm INTRODUCTION TO GENERAL STATE-SPACE MARKOV CHAIN THEORY Introduction Notation and Definitions Irreducibility, Recurrence, and Convergence Harris Recurrence Mixing Rates and Central Limit Theorems Regeneration Discussion FULL CONDITIONAL DISTRIBUTIONS Introduction Deriving Full Conditional Distributions Sampling from Full Conditional Distributions Discussion STRATEGIES FOR IMPROVING MCMC Introduction Reparameterization Random and Adaptive Direction Sampling Modifying the Stationary Distribution Methods Based on Continuous-Time Processes Discussion IMPLEMENTING MCMC Introduction Determining the Number of Iterations Software and Implementation Output Analysis Generic Metropolis Algorithms Discussion INFERENCE AND MONITORING CONVERGENCE Difficulties in Inference from Markov Chain Simulation The Risk of Undiagnosed Slow Convergence Multiple Sequences and Overdispersed Starting Points Monitoring Convergence Using Simulation Output Output Analysis for Inference Output Analysis for Improving Efficiency MODEL DETERMINATION USING SAMPLING-BASED METHODS Introduction Classical Approaches The Bayesian Perspective and the Bayes Factor Alternative Predictive Distributions How to Use Predictive Distributions Computational Issues An Example Discussion HYPOTHESIS TESTING AND MODEL SELECTION Introduction Uses of Bayes Factors Marginal Likelihood Estimation by Importance Sampling Marginal Likelihood Estimation Using Maximum Likelihood Application: How Many Components in a Mixture? Discussion Appendix: S-PLUS Code for the Laplace-Metropolis Estimator MODEL CHECKING AND MODEL IMPROVEMENT Introduction Model Checking Using Posterior Predictive Simulation Model Improvement via Expansion Example: Hierarchical Mixture Modelling of Reaction Times STOCHASTIC SEARCH VARIABLE SELECTION Introduction A Hierarchical Bayesian Model for Variable Selection Searching the Posterior by Gibbs Sampling Extensions Constructing Stock Portfolios With SSVS Discussion BAYESIAN MODEL COMPARISON VIA JUMP DIFFUSIONS Introduction Model Choice Jump-Diffusion Sampling Mixture Deconvolution Object Recognition Variable Selection Change-Point Identification Conclusions ESTIMATION AND OPTIMIZATION OF FUNCTIONS Non-Bayesian Applications of MCMC Monte Carlo Optimization Monte Carlo Likelihood Analysis Normalizing-Constant Families Missing Data Decision Theory Which Sampling Distribution? Importance Sampling Discussion STOCHASTIC EM: METHOD AND APPLICATION Introduction The EM Algorithm The Stochastic EM Algorithm Examples GENERALIZED LINEAR MIXED MODELS Introduction Generalized Linear Models (GLMs) Bayesian Estimation of GLMs Gibbs Sampling for GLMs Generalized Linear Mixed Models (GLMMs) Specification of Random-Effect Distributions Hyperpriors and the Estimation of Hyperparameters Some Examples Discussion HIERARCHICAL LONGITUDINAL MODELLING Introduction Clinical Background Model Detail and MCMC Implementation Results Summary and Discussion MEDICAL MONITORING Introduction Modelling Medical Monitoring Computing Posterior Distributions Forecasting Model Criticism Illustrative Application Discussion MCMC FOR NONLINEAR HIERARCHICAL MODELS Introduction Implementing MCMC Comparison of Strategies A Case Study from Pharmacokinetics-Pharmacodynamics Extensions and Discussion BAYESIAN MAPPING OF DISEASE Introduction Hypotheses and Notation Maximum Likelihood Estimation of Relative Risks Hierarchical Bayesian Model of Relative Risks Empirical Bayes Estimation of Relative Risks Fully Bayesian Estimation of Relative Risks Discussion MCMC IN IMAGE ANALYSIS Introduction The Relevance of MCMC to Image Analysis Image Models at Different Levels Methodological Innovations in MCMC Stimulated by Imaging Discussion MEASUREMENT ERROR Introduction Conditional-Independence Modelling Illustrative examples Discussion GIBBS SAMPLING METHODS IN GENETICS Introduction Standard Methods in Genetics Gibbs Sampling Approaches MCMC Maximum Likelihood Application to a Family Study of Breast Cancer Conclusions MIXTURES OF DISTRIBUTIONS: INFERENCE AND ESTIMATION Introduction The Missing Data Structure Gibbs Sampling Implementation Convergence of the Algorithm Testing for Mixtures Infinite Mixtures and Other Extensions AN ARCHAEOLOGICAL EXAMPLE: RADIOCARBON DATING Introduction Background to Radiocarbon Dating Archaeological Problems and Questions Illustrative Examples Discussion Index

7,399 citations