scispace - formally typeset
Open AccessJournal ArticleDOI

Feature location in source code: a taxonomy and survey

Reads0
Chats0
TLDR
A systematic literature survey of feature location techniques is presented and eighty‐nine articles from 25 venues have been reviewed and classified within the taxonomy in order to organize and structure existing work in the field of feature locations.
Abstract
SUMMARY Feature location is the activity of identifying an initial location in the source code that implements functionality in a software system. Many feature location techniques have been introduced that automate some or all of this process, and a comprehensive overview of this large body of work would be beneficial to researchers and practitioners. This paper presents a systematic literature survey of feature location techniques. Eighty-nine articles from 25 venues have been reviewed and classified within the taxonomy in order to organize and structure existing work in the field of feature location. The paper also discusses open issues and defines future directions in the field of feature location. Copyright © 2011 John Wiley & Sons, Ltd.

read more

Content maybe subject to copyright    Report

CRC to Journal of Software Maintenance and Evolution: Research and Practice
Feature Location in Source Code:
A Taxonomy and Survey
Bogdan Dit, Meghan Revelle, Malcom Gethers, Denys Poshyvanyk
The College of William and Mary
________________________________________________________________________
Feature location is the activity of identifying an initial location in the source code that
implements functionality in a software system. Many feature location techniques have
been introduced that automate some or all of this process, and a comprehensive overview
of this large body of work would be beneficial to researchers and practitioners. This
paper presents a systematic literature survey of feature location techniques. Eighty-nine
articles from 25 venues have been reviewed and classified within the taxonomy in order
to organize and structure existing work in the field of feature location. The paper also
discusses open issues and defines future directions in the field of feature location.
Keywords: Feature location, concept location, program comprehension, software
maintenance and evolution
________________________________________________________________________
1. INTRODUCTION
In software systems, a feature represents a functionality that is defined by requirements
and accessible to developers and users. Software maintenance and evolution involves
adding new features to programs, improving existing functionalities, and removing bugs,
which is analogous to removing unwanted functionalities. Identifying an initial location
in the source code that corresponds to a specific functionality is known as feature (or
concept) location [Biggerstaff'94, Rajlich'02]. It is one of the most frequent maintenance
activities undertaken by developers because it is a part of the incremental change process
[Rajlich'04]. During the incremental change process, programmers use feature location
to find where in the code the first change to complete a task needs to be made. The full
extent of the change is then handled by impact analysis, which starts with the source code
identified by feature location and finds all the code affected by the change.
Methodologically, the two activities of feature location and impact analysis are different
and are treated separately in the literature and in this survey.
Feature location is one of the most important and common activities performed by
programmers during software maintenance and evolution. No maintenance activity can
be completed without first locating the code that is relevant to the task at hand, making
feature location essential to software maintenance since it is performed in the context of
incremental change. For example, Alice is a new developer on a software project, and
her manager has given her the task of fixing a bug that has been recently reported. Since
Alice is new to this project, she is unfamiliar with the large code base of the software
system and does not know where to begin. Lacking sufficient documentation on the
system and the ability to ask the code’s original authors for help, the only option Alice
sees is to manually search for the code relevant to her task.
Alice’s situation is one faced by many software developers needing to understand and
modify an unfamiliar codebase. However, a manual search of a large amount of source
code, even with the help of tools such as pattern matchers or an integrated development

2 B. Dit M. Revelle M. Gethers and D. Poshyvanyk
CRC to Journal of Software Maintenance and Evolution: Research and Practice
environment, can be frustrating and time-consuming. Recognizing this problem,
software engineering researchers have developed a number of feature location techniques
(FLTs) to come to aid programmers in Alice’s position. The various techniques that have
been introduced are all unique in terms of their input requirements, how they locate a
feature’s implementation, and how they present their results. Thus, even the task of
choosing a suitable feature location technique can be challenging.
The existence of such a large body of feature location research calls for a
comprehensive overview. Since there currently is no broad summary of the field of
feature location, this paper provides a systematic survey and operational taxonomy of this
pertinent research area. To the best of our knowledge, Wilde et al. [Wilde'03] is the only
other survey, which in contrast to our survey, compares only a few feature location
techniques. Our survey includes research articles that introduce new feature location
approaches; case, industrial, and user studies; and tools that can be used in support of
feature location. The articles are characterized within a taxonomy that has nine
dimensions, and each dimension has a set of attributes associated with it. The dimensions
and attributes of the taxonomy capture key facets of typical feature location techniques
and can be useful to both software engineering researchers and practitioners
[Marcus'05b]. Researchers can use this survey to identify what has been done in the area
of feature location and what needs to be done; that is, they can use it to find related work
as well as opportunities for future research. Practitioners can use this overview to
determine which feature location approach is most suited to their needs.
This survey encompasses 89 articles (60 research articles and 29 tool and case study
papers) from 25 venues published between November 1992 and February 2011. These
research articles were selected because they either state feature/concept location as their
goal or present a technique that is essentially equivalent to feature location. The tool
papers include tools developed specifically for feature location as well as program
exploration tools that support feature location. The case study articles include industrial
and user studies as well as studies that compare existing approaches.
There are several research areas that are closely related to feature location, such as
traceability link recovery, impact analysis, and aspect mining. Traceability link recovery
seeks to connect different types of software artifacts (e.g., documentation with source
code), while feature location is more concerned with identifying source code associated
with functionalities, not specific sections of a document. Impact analysis is the step in
the incremental change process performed after feature location with the purpose of
expanding on feature location’s results, especially after a change is made to the source
code. Feature location focuses on finding the starting point for that change. The main
goal of aspect mining is to identify cross-cutting concerns and determine the source code
that should be refactored into aspects, meaning the aspects themselves are not known a
priori. By contrast, in the contexts in which feature location is used, the high-level
descriptions of features are already known and only the code that implements them is
unknown. Therefore, articles and research from these related fields are not included here
as they are beyond the scope of this focused survey.
The work presented in this paper has two main contributions. The first is a systematic
survey of feature location techniques, relevant case studies, and tools. The second is the
taxonomy derived from those techniques. An online appendix
1
lists all of the surveyed
articles classified within the taxonomy. Section 2 presents the systematic review process.
Section 3 introduces the dimensions of the taxonomy, and Section 4 provides brief
descriptions of the surveyed approaches. Section 5 overviews the feature location tools
1
http://www.cs.wm.edu/semeru/data/feature-location-survey/ (accessed and verified on 03/01/2011)

Feature Location in Source Code: A Taxonomy and Survey 3
CRC to Journal of Software Maintenance and Evolution: Research and Practice
and studies, and Section 6 provides an analysis of the taxonomy. Section 7 discusses open
issues in feature location and Section 8 concludes.
2. SYSTEMATIC REVIEW PROCESS
In this paper we perform a systematic survey of the feature location literature in order
to address the following research questions (RQ):
RQ
1
: What types of analysis are used while performing feature location?
RQ
2
: Has there been a change in types of analysis used to identify features in source
code employed by recent feature location techniques?
RQ
3
: Are there any limitations to current strategies for evaluating various feature
location techniques?
In order to answer these research questions, we conducted a systematic review of the
literature using the following process (see Figure 1):
Search: the initial set of articles to be considered during the selection process is
determined by identifying pertinent journals, conferences and workshops.
Article Selection: using inclusion and exclusion criteria the initial set of articles
is filtered and only relevant articles are considered beyond this step.
Article Characterization: articles, which meet the selection criteria, are
classified according to the set of attributes that capture important characteristics
of feature location techniques.
Analysis: using the resulting taxonomy and systematic classification of the
papers, the research questions are answered and useful insights about the state of
feature location research and practice are outlined.
2.1. Search
An initial subset of papers of interest was obtained by manually evaluating articles that
appear in different venues considered during our preliminary exploration. We select
venues where feature location research is within their respective scope. Also, choosing
such venues ensures that selected articles meet some standard (e.g., the papers went
through a rigorous peer review process).
2.2. Article Selection
To adhere to the properties of systematic reviews [Kitchenham'04] we define the
following inclusion and exclusion criteria. In order to be included in the survey, a paper
must introduce, evaluate, and/or complement the implementation of a source code based
feature location technique. This includes papers that introduce novel feature location
techniques, evaluate various existing feature location techniques, or present tools
implementing existing or new approaches to feature location. The papers, which focused
on improving the performance of underlying analysis techniques (e.g., dynamic analysis,
Information Retrieval), as opposed to the feature location process were excluded.
2.3. Article Classification
The authors read and categorized each article according to the taxonomy and the
attributes presented in Section 3. The process of classifying the articles was followed by
four authors individually. Using initial classifications produced by the authors we
identified papers that had some disagreements and further discussed those papers. The
set of attributes was extracted and defined by two of the authors. Having all four authors
characterize the articles allows us to verifying the quality of the taxonomy, minimizing
potential bias. In certain cases disagreements served as an indication that our taxonomy
and attributes or their corresponding descriptions required refinement. Through this

4 B. Dit M. Revelle M. Gethers and D. Poshyvanyk
CRC to Journal of Software Maintenance and Evolution: Research and Practice
process we were able to improve the quality of our taxonomy and attribute set as well as
improve their descriptions.
2.4. Analysis
Following the process of classifying research papers our final step includes analysis the
results, answers to the research questions as well as an outline of future directions for
researchers and practitioners investigating feature location techniques. In order to
complete this step we analyzed the trends in our resulting taxonomy and observed
interesting co-occurrences of various attributes across feature location techniques. We
also investigated characteristics that rarely apply to the set of techniques considered as
well as characteristics which are currently emerging in the research literature.
3. DIMENSIONS OF THE SURVEY
The goal of this survey is to provide researchers and practitioners with a structured
overview of existing research in the area of feature location. From a methodical
inspection of the research literature we extracted a number of key dimensions
2
. These
dimensions objectively describe different techniques and offer structure to the surveyed
literature. The dimensions are as follows:
The type of analysis: What underlying analyses are used to support feature
location?
The type of user input: What does a developer have to provide as an input to the
feature location technique?
Data sources: What derivative artifacts have to be provided as an input for the
feature location technique?
2
Some of these dimensions were discussed at the working session on Information Retrieval Approaches
in Software Evolution at 22
nd
IEEE International Conference on Software Maintenance (ICSM’06):
http://www.cs.wayne.edu/~amarcus/icsm2006/
Figure 1 Systematic review process

Feature Location in Source Code: A Taxonomy and Survey 5
CRC to Journal of Software Maintenance and Evolution: Research and Practice
Output: What type of the results and how are they provided back to the user?
Programming language support: On which programming languages was this
technique instantiated?
The evaluation of the approach: How was this feature location technique
evaluated?
Systems evaluated: What are the systems that were used in the evaluation?
The order in which these dimensions are presented does not imply any explicit priority or
importance.
Each dimension has a number of distinct attributes associated with it. For a given
dimension, a feature location technique may be associated with multiple attributes. These
dimensions and their attributes were derived by examining an initial set of articles of
interest. They were then refined and generalized to succinctly characterize the properties
that make feature location techniques unique, and can be used to evaluate and compare
them. The goal of the taxonomy’s dimensions and attributes it to allow researchers and
practitioners to easily locate the feature location techniques that are most suited to their
needs. The dimensions and their associated attributes that are used in the taxonomy of
the surveyed articles are listed in Table 1. These dimensions and attributes are discussed
in the remainder of this section. The attributes are highlighted in italics.
3.1. Type of Analysis
A main distinguishing factor of feature location techniques is the type, or types of
analyses they employ to identify the code that pertains to a feature. The most common
types of analyses include dynamic, static, and textual. While these are not the only types
of analysis possible, they are the ones utilized by the vast majority of feature location
techniques, and some approaches even leverage more than one of these types of analysis.
In Section 4, descriptions of all the surveyed articles are given, and the section is
organized by the type(s) of analysis used.
Dynamic analysis refers to examining a software system’s execution, and it is often
used for feature location when features can be invoked and observed during runtime.
Feature location using dynamic analysis generally relies on a post-mortem analysis of an
execution trace. Typically, one or more feature-specific scenarios are developed that
invoke only the desired feature. Then, the scenarios are run and execution traces are
collected, recording information about the code that was invoked. These traces are
captured either by instrumenting the system or through profiling. Once the traces are
obtained, feature location can be performed in several ways. The traces can be compared
to other traces in which the feature was not invoked to find code only invoked in the
feature-specific traces [Eisenbarth'03, Wilde'95]. Alternatively, the frequency of
execution portions of code can be analyzed to locate a feature’s implementation
[Antoniol'06, Eisenberg'05, Safyallah'06]. Using dynamic analysis for feature location is
a popular choice since most features can be mapped to execution scenarios. However,
there are some limitations associated with dynamic analysis. The collection of traces can
impose considerable overhead on a system’s execution. Additionally, the scenarios used
to collect traces may not invoke all of the code that is relevant to the feature, meaning
that some of the feature’s implementation may not be located. Conversely, it may be
difficult to formulate a scenario that invokes only the desired feature, causing irrelevant
code to be executed. Dynamic feature location techniques are discussed in Section 4.2.
Static analysis examines structural information such as control or data flow
dependencies. In manual feature location, developers may follow program dependencies
in a section of code they deem to be relevant in order to find additional useful code, and

Citations
More filters
Journal ArticleDOI

Software development in startup companies: A systematic mapping study

TL;DR: The results indicate that software engineering work practices are chosen opportunistically, adapted and configured to provide value under the constrains imposed by the startup context.
Proceedings ArticleDOI

Improving bug localization using structured information retrieval

TL;DR: This work provides a thorough grounding of IR-based bug localization research in fundamental IR theoretical and empirical knowledge and practice and presents BLUiR, which embodies this insight, requires only the source code and bug reports, and takes advantage of bug similarity data if available.
Proceedings ArticleDOI

How to effectively use topic models for software engineering tasks? an approach based on genetic algorithms

TL;DR: A novel solution to adapt, configure and effectively use a topic modeling technique, namely Latent Dirichlet Allocation (LDA), to achieve better (acceptable) performance across various SE tasks is proposed.
Journal ArticleDOI

A survey of code‐based change impact analysis techniques

TL;DR: The study presents a comparative framework including seven properties, which characterize the CIA techniques, and identifies key applications of CIA techniques in software maintenance, and proposes new CIA techniques under the proposed framework.
Proceedings ArticleDOI

Automatic query reformulations for text retrieval in software engineering

TL;DR: A recommender (called Refoqus) based on machine learning is proposed, which is trained with a sample of queries and relevant results and automatically recommends a reformulation strategy that should improve its performance, based on the properties of the query.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Journal ArticleDOI

The anatomy of a large-scale hypertextual Web search engine

TL;DR: This paper provides an in-depth description of Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and looks at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
Journal Article

The Anatomy of a Large-Scale Hypertextual Web Search Engine.

Sergey Brin, +1 more
- 01 Jan 1998 - 
TL;DR: Google as discussed by the authors is a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext and is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems.
Journal ArticleDOI

Indexing by Latent Semantic Analysis

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Related Papers (5)
Frequently Asked Questions (11)
Q1. What are the contributions in "Feature location in source code: a taxonomy and survey" ?

Many feature location techniques have been introduced that automate some or all of this process, and a comprehensive overview of this large body of work would be beneficial to researchers and practitioners. This paper presents a systematic literature survey of feature location techniques. The paper also discusses open issues and defines future directions in the field of feature location. 

The taxonomy facilitates the comparison of existing feature location techniques and illuminates possible areas of future research. 

The NASA Talk Load Index (TLX) was used to assess task difficulty, and distance profiles were used to gauge the degree to which the participants remained on-task. 

Another way to evaluate a feature location approach is to have system experts or even non-experts assess the results, which is an evaluation method often used by IR-based search engines. 

The approaches to establish the mapping between the description of the feature and the source code include textual search with grep [Petrenko'08], Information Retrieval [Cleary'09, Gay'09, Marcus'04, Poshyvanyk'07b], and natural language processing [Hill'09, Shepherd'07]. 

No maintenance activity can be completed without first locating the code that is relevant to the task at hand, making feature location essential to software maintenance since it is performed in the context of incremental change. 

Because the granularity of the input program elements is more fine grained (i.e., variables), the results are also more fine grained than other FLTs. 

due to the fact that the feature location field is not as matured as other software engineering fields, there are no papers that fall into the categories survey and experiment. 

Through thisCRC to Journal of Software Maintenance and Evolution: Research and Practiceprocess the authors were able to improve the quality of their taxonomy and attribute set as well as improve their descriptions. 

These datasets could contain a list of features, textual description or documentation about the features, mappings between features or bugs and program elements that are relevant to fixing the bug or implementing the feature (referred to as gold sets in the literature), patches submitted to an issue tracker, etc. 

The results show that among the 984 functions in Mosaic, the developer performing concept location on a maintenance task was able to partially comprehend the system by investigating only 22 (2%) of the functions.