Topic Modeling on User Stories using Word Mover's Distance

Open AccessPosted Content

Topic Modeling on User Stories using Word Mover's Distance

Kim Julian Gulle, +4 more

- 10 Jul 2020 -

arXiv: Computation and Language

Chats0

TLDR

In this article, the authors focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combined word embedding and Word Mover's distance.

Abstract:

Requirements elicitation has recently been complemented with crowd-based techniques, which continuously involve large, heterogeneous groups of users who express their feedback through a variety of media. Crowd-based elicitation has great potential for engaging with (potential) users early on but also results in large sets of raw and unstructured feedback. Consolidating and analyzing this feedback is a key challenge for turning it into sensible user requirements. In this paper, we focus on topic modeling as a means to identify topics within a large set of crowd-generated user stories and compare three approaches: (1) a traditional approach based on Latent Dirichlet Allocation, (2) a combination of word embeddings and principal component analysis, and (3) a combination of word embeddings and Word Mover's Distance. We evaluate the approaches on a publicly available set of 2,966 user stories written and categorized by crowd workers. We found that a combination of word embeddings and Word Mover's Distance is most promising. Depending on the word embeddings we use in our approaches, we manage to cluster the user stories in two ways: one that is closer to the original categorization and another that allows new insights into the dataset, e.g. to find potentially new categories. Unfortunately, no measure exists to rate the quality of our results objectively. Still, our findings provide a basis for future work towards analyzing crowd-sourced user stories.

Topic Modeling on User Stories using Word Mover's Distance

Citations

The Use of NLP-Based Text Representation Techniques to Support Requirement Engineering Tasks: A Systematic Mapping Review

Automatic Creation of Acceptance Tests by Extracting Conditionals from Requirements: NLP Approach and Case Study

Identification of Intra-Domain Ambiguity using Transformer-based Machine Learning

TABASCO: A transformer based contextualization toolkit

References

Latent dirichlet allocation

Visualizing Data using t-SNE

Latent Dirichlet Allocation

Distributed Representations of Words and Phrases and their Compositionality

Efficient Estimation of Word Representations in Vector Space

Related Papers (5)

Topic Modeling on User Stories using Word Mover's Distance

Evaluating neural word embeddings created from online course reviews for sentiment analysis

Evaluating topic models for digital libraries

Deep learning framework based on Word2Vec and CNNfor users interests classification

Automated Classification of Fake News Spreaders to Break the Misinformation Chain