Classifying Mobile Applications Using Word Embeddings

doi:10.1145/3474827

Home
/
Papers
/
Classifying Mobile Applications Using Word Embeddings

Journal Article•DOI•

Classifying Mobile Applications Using Word Embeddings

EbrahimiFahimeh¹, TushevMiroslav¹, MahmoudAnas¹•Institutions (1)

Louisiana State University¹

30 Apr 2022-ACM Transactions on Software Engineering and Methodology (ACMPUB27New York, NY)-Vol. 31, Iss: 2, pp 1-30

TL;DR: In this paper, the authors present a method to classify apps by choosing from a set of generic categories, or genres, such as health, games, and music, which are typically static categories.

read less

Abstract: Modern application stores enable developers to classify their apps by choosing from a set of generic categories, or genres, such as health, games, and music. These categories are typically static—n...

...read moreread less

Citations

PDF

Open Access

More filters

Proceedings Article•DOI•

Investigating the Effectiveness of Clustering for Story Point Estimation

[...]

Vali Tawosi, Afnan A. Al-Subaihin, Federica Sarro

01 Mar 2022

TL;DR: This paper presents a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering and using hierarchical clustering to agglomerate issues into clusters based on their topic similarities.

...read moreread less

Abstract: Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques' accuracy has room for improvement. In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP. Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques. The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation. The experimental data and scripts we used in this work are publicly available to allow for replication and extension.

...read moreread less

8 citations

Journal Article•DOI•

Strategies, Benefits and Challenges of App Store-inspired Requirements Elicitation

[...]

Alessio Ferrari, Paola Spoletini

28 Jan 2023-arXiv.org

TL;DR: In this paper , the authors conduct an experimental simulation with 58 analysts and collect qualitative data to study strategies, beneﬁts, and challenges of app store-inspired elicitation, and compare this technique with the more traditional requirements elicitation interviews.

...read moreread less

Abstract: —App store-inspired elicitation is the practice of exploring competitors’ apps, to get inspiration for requirements. This activity is common among developers, but little insight is available on its practical use, advantages and possible issues. This paper aims to study strategies, beneﬁts, and challenges of app store-inspired elicitation, and to compare this technique with the more traditional requirements elicitation interviews. We conduct an experimental simulation with 58 analysts, and collect qualitative data. Our results show that: (1) speciﬁc guidelines and procedures are required to better conduct app store-inspired elicitation; (2) current search features made available by app stores are not suitable for this practice, and more tool support is required to help analysts in the retrieval and evaluation of competing products; (3) while interviews focus on the why dimen-sion of requirements engineering (i.e., goals), app store-inspired elicitation focuses on how (i.e., solutions), offering indications for implementation and improved usability. Our study provides a framework for researchers to address existing challenges and suggests possible beneﬁts to foster app store-inspired elicitation among practitioners.

...read moreread less

1 citations

Journal Article•DOI•

Annotating Privacy Policies in the Sharing Economy

[...]

Fahime Ebrahimi, Miroslav Tushev, Anas Mahmoud

26 Oct 2022-arXiv.org

TL;DR: An automated approach for annotating privacy policies in the DSE market is proposed to help DSE app developers to draft more comprehensible privacy policies as well as help their end-users to make more informed decisions in one of the fastest growing software ecosystems in the world.

...read moreread less

Abstract: —Applications (apps) of the Digital Sharing Economy (DSE), such as Uber, Airbnb, and TaskRabbit, have become a main enabler of economic growth and shared prosperity in modern-day societies. However, the complex exchange of goods, services, and data that takes place over these apps frequently puts their end-users’ privacy at risk. Privacy policies of DSE apps are provided to disclose how private user data is being collected and handled. However, in reality, such policies are verbose and difﬁcult to understand, leaving DSE users vulnerable to privacy intrusive practices. To address these concerns, in this paper, we propose an automated approach for annotating privacy policies in the DSE market. Our approach identiﬁes data collection claims in these policies and maps them to the quality features of their apps. Visual and textual annotations are then used to further explain and justify these claims. The proposed approach is evaluated with 18 DSE app users. The results show that annotating privacy policies can signiﬁcantly enhance their comprehensibility to the average DSE user. Our ﬁndings are intended to help DSE app developers to draft more comprehensible privacy policies as well as help their end-users to make more informed decisions in one of the fastest growing software ecosystems in the world.

...read moreread less

Book Chapter•DOI•

Autoclassify Software Defects Using Orthogonal Defect Classification

[...]

Sushil Kumar, Meera Sharma, Sunil Kumar Muttoo, Vir Bahadur Singh

01 Jan 2022

TL;DR: In this paper , the authors proposed an auto classify approach to classify the defects into impact categories as defined by Orthogonal Defect Classification (ODC), which is a popular model for classifying defects and it provides an in-depth analysis of the defects.

...read moreread less

Abstract: AbstractSoftware systems have become an integral part of all the organizations. These systems are performing many critical operations. A defect in these systems affects the product quality and the software development process. Prediction of the impact category of these defects helps in improving defect management process as well as taking correct decisions to fix defects. Orthogonal defect classification is a popular model for classifying defects and it provides an in-depth analysis of the defects. In this study, we proposed an auto classify approach to classify the defects into impact categories as defined by Orthogonal Defect Classification (ODC). Bag of words, term frequency-inverse document frequency and word embedding have been used to represent the textual data into numeric vectors. For experimental work, we have used 4,096 reports form three NoSQL databases. We have trained and tested the proposed autoclassify approach using Support Vector Machine (SVM) and Random Forest Classifier (RFC). We achieved maximum accuracy 94% and 85.99% using SVM and RFC respectively.KeywordsOrthogonal defect classificationBag of wordsTerm frequency inverse document frequencyWord embedding

...read moreread less

References

PDF

Open Access

More filters

Journal Article•

Scikit-learn: Machine Learning in Python

[...]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel¹, Peter Prettenhofer², Ron Weiss³, Vincent Dubourg, Jake Vanderplas⁴, Alexandre Passos⁵, David Cournapeau, Matthieu Brucher⁶, Matthieu Perrot, Edouard Duchesnay - Show less +12 more•Institutions (6)

Kobe University¹, Bauhaus University, Weimar², Google³, University of Washington⁴, University of Massachusetts Amherst⁵, Total S.A.⁶

01 Feb 2011-Journal of Machine Learning Research

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.

...read moreread less

Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

...read moreread less

47,974 citations

Journal Article•DOI•

A Coefficient of agreement for nominal Scales

[...]

Jacob Cohen¹•Institutions (1)

York University¹

01 Apr 1960-Educational and Psychological Measurement

TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.

...read moreread less

Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

...read moreread less

34,965 citations

Journal Article•DOI•

Latent dirichlet allocation

[...]

David M. Blei¹, Andrew Y. Ng², Michael I. Jordan¹•Institutions (2)

University of California, Berkeley¹, Stanford University²

01 Mar 2003-Journal of Machine Learning Research

TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.

...read moreread less

Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

...read moreread less

30,570 citations

Proceedings Article•DOI•

Glove: Global Vectors for Word Representation

[...]

Jeffrey Pennington¹, Richard Socher², Christopher D. Manning¹•Institutions (2)

Stanford University¹, University of Colorado Boulder²

01 Oct 2014

TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.

...read moreread less

Abstract: Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

...read moreread less

30,558 citations

Proceedings Article•

Distributed Representations of Words and Phrases and their Compositionality

[...]

Tomas Mikolov¹, Ilya Sutskever¹, Kai Chen¹, Greg S. Corrado¹, Jeffrey Dean¹ - Show less +1 more•Institutions (1)

Google¹

05 Dec 2013

TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

...read moreread less

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

...read moreread less

24,012 citations