scispace - formally typeset
Open AccessPosted Content

Topic Modeling on User Stories using Word Mover's Distance

Reads0
Chats0
TLDR
In this article, the authors focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combined word embedding and Word Mover's distance.
Abstract
Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

read more

Citations
More filters
Journal ArticleDOI

The Use of NLP-Based Text Representation Techniques to Support Requirement Engineering Tasks: A Systematic Mapping Review

TL;DR: A survey in the form of a systematic literature mapping (classification) finds out what are the representations used in RE tasks literature, and identifies four gaps in the existing literature, why they matter, and how future research can begin to address them.
Journal ArticleDOI

Automatic Creation of Acceptance Tests by Extracting Conditionals from Requirements: NLP Approach and Case Study

TL;DR: CiRA as mentioned in this paper is a tool-supported approach for automatically deriving test cases from conditional statements in informal requirements, which is capable of creating the minimal set of required test cases.
Proceedings ArticleDOI

Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning

TL;DR: This work proposes an approach based on the idea of bidirectional encoder representations from Transformers (BERT) and clustering and shows that this approach is very effective in identifying and detecting intra-domain ambiguities.
Journal ArticleDOI

TABASCO: A transformer based contextualization toolkit

TL;DR: TABASCO as mentioned in this paper is a tool for detecting and identifying intra-domain ambiguity in software requirements and other project-related documents written using natural language (NL) text using BERT as a language model.
References
More filters
Journal ArticleDOI

Latent dirichlet allocation

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Journal Article

Visualizing Data using t-SNE

TL;DR: A new technique called t-SNE that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map, a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map.
Proceedings Article

Latent Dirichlet Allocation

TL;DR: This paper proposed a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hof-mann's aspect model, also known as probabilistic latent semantic indexing (pLSI).
Proceedings Article

Distributed Representations of Words and Phrases and their Compositionality

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Posted Content

Efficient Estimation of Word Representations in Vector Space

TL;DR: This paper proposed two novel model architectures for computing continuous vector representations of words from very large data sets, and the quality of these representations is measured in a word similarity task and the results are compared to the previously best performing techniques based on different types of neural networks.
Related Papers (5)