Mining business topics in source code using latent dirichlet allocation

doi:10.1145/1342211.1342234

Proceedings ArticleDOI

Mining business topics in source code using latent dirichlet allocation

Girish Maskeri, +2 more

- pp 113-120

Chats0

TLDR

Preliminary results indicate that LDA is able to identify some of the domain topics and is a satisfactory starting point for further manual refinement of topics, and a human assisted approach based on LDA for extracting domain topics from source code is proposed.

Abstract:

One of the difficulties in maintaining a large software system is the absence of documented business domain topics and correlation between these domain topics and source code. Without such a correlation, people without any prior application knowledge would find it hard to comprehend the functionality of the system. Latent Dirichlet Allocation (LDA), a statistical model, has emerged as a popular technique for discovering topics in large text document corpus. But its applicability in extracting business domain topics from source code has not been explored so far. This paper investigates LDA in the context of comprehending large software systems and proposes a human assisted approachbased on LDA for extracting domain topics from source code. This method has been applied on a number of open source and proprietary systems. Preliminary results indicate that LDA is able to identify some of the domain topics and isa satisfactory starting point for further manual refinement of topics

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Software traceability with topic modeling

Hazeline U. Asuncion, +2 more

TL;DR: An automated technique that combines traceability with a machine learning technique known as topic modeling is proposed that automatically records traceability links during the software development process and learns a probabilistic topic model over artifacts.

...read moreread less

Journal ArticleDOI

Bug localization using latent Dirichlet allocation

Stacy K. Lukins, +2 more

- 01 Sep 2010 -

Information & Software Technology

TL;DR: An effective static technique for automatic bug localization can be built around Latent Dirichlet allocation (LDA), and there is no significant relationship between the accuracy of the LDA-based technique and the size of the subject software system or the stability of its source code base.

...read moreread less

Proceedings ArticleDOI

Automatically capturing source code context of NL-queries for software maintenance and reuse

Emily Hill, +2 more

TL;DR: A novel approach is presented that automatically extracts natural language phrases from source code identifiers and categorizes the phrases and search results in a hierarchy and significantly outperforms the most closely related technique in terms of effort and effectiveness.

...read moreread less

Proceedings ArticleDOI

Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation

Stacy K. Lukins, +2 more

TL;DR: In this article, the authors present an LDA-based static technique for bug localization based on the latent Dirichlet allocation (LDA) model, which has significant advantages over both LSI and probabilistic LSI.

...read moreread less

Journal ArticleDOI

What is wrong with topic modeling? And how to fix it using search-based software engineering

Amritanshu Agrawal, +2 more

- 01 Jun 2018 -

Information & Software Technology

TL;DR: LDADE, a search-based software engineering tool which uses Differential Evolution (DE) to tune the LDA’s parameters, is used to provide a method in which distributions generated by LDA are more stable and can be used for further analysis.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Latent dirichlet allocation

David M. Blei, +2 more

- 01 Mar 2003 -

Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Proceedings Article

Latent Dirichlet Allocation

David M. Blei, +2 more

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).

...read moreread less

Journal ArticleDOI

Indexing by Latent Semantic Analysis

Scott Deerwester, +4 more

- 01 Sep 1990 -

Journal of the Association for Informati...

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Journal ArticleDOI

An algorithm for suffix stripping

M. F. Porter

- 01 Dec 1997 -

Program: Electronic Library and Informat...

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.

...read moreread less

Journal ArticleDOI

Finding scientific topics

Thomas L. Griffiths, +1 more

- 06 Apr 2004 -

Proceedings of the National Academy of S...

TL;DR: A generative model for documents is described, introduced by Blei, Ng, and Jordan, and a Markov chain Monte Carlo algorithm is presented for inference in this model, which is used to analyze abstracts from PNAS by using Bayesian model selection to establish the number of topics.

...read moreread less