scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Classifying Mobile Applications Using Word Embeddings

30 Apr 2022-ACM Transactions on Software Engineering and Methodology (ACMPUB27New York, NY)-Vol. 31, Iss: 2, pp 1-30
TL;DR: In this paper, the authors present a method to classify apps by choosing from a set of generic categories, or genres, such as health, games, and music, which are typically static categories.
Abstract: Modern application stores enable developers to classify their apps by choosing from a set of generic categories, or genres, such as health, games, and music. These categories are typically static—n...
Citations
More filters
Proceedings ArticleDOI
01 Mar 2022
TL;DR: This paper presents a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering and using hierarchical clustering to agglomerate issues into clusters based on their topic similarities.
Abstract: Automated techniques to estimate Story Points (SP) for user stories in agile software development came to the fore a decade ago. Yet, the state-of-the-art estimation techniques' accuracy has room for improvement. In this paper, we present a new approach for SP estimation, based on analysing textual features of software issues by employing latent Dirichlet allocation (LDA) and clustering. We first use LDA to represent issue reports in a new space of generated topics. We then use hierarchical clustering to agglomerate issues into clusters based on their topic similarities. Next, we build estimation models using the issues in each cluster. Then, we find the closest cluster to the new coming issue and use the model from that cluster to estimate the SP. Our approach is evaluated on a dataset of 26 open source projects with a total of 31,960 issues and compared against both baselines and state-of-the-art SP estimation techniques. The results show that the estimation performance of our proposed approach is as good as the state-of-the-art. However, none of these approaches is statistically significantly better than more naive estimators in all cases, which does not justify their additional complexity. We therefore encourage future work to develop alternative strategies for story points estimation. The experimental data and scripts we used in this work are publicly available to allow for replication and extension.

8 citations

Journal ArticleDOI
TL;DR: In this paper , the authors conduct an experimental simulation with 58 analysts and collect qualitative data to study strategies, benefits, and challenges of app store-inspired elicitation, and compare this technique with the more traditional requirements elicitation interviews.
Abstract: —App store-inspired elicitation is the practice of exploring competitors’ apps, to get inspiration for requirements. This activity is common among developers, but little insight is available on its practical use, advantages and possible issues. This paper aims to study strategies, benefits, and challenges of app store-inspired elicitation, and to compare this technique with the more traditional requirements elicitation interviews. We conduct an experimental simulation with 58 analysts, and collect qualitative data. Our results show that: (1) specific guidelines and procedures are required to better conduct app store-inspired elicitation; (2) current search features made available by app stores are not suitable for this practice, and more tool support is required to help analysts in the retrieval and evaluation of competing products; (3) while interviews focus on the why dimen-sion of requirements engineering (i.e., goals), app store-inspired elicitation focuses on how (i.e., solutions), offering indications for implementation and improved usability. Our study provides a framework for researchers to address existing challenges and suggests possible benefits to foster app store-inspired elicitation among practitioners.

1 citations

Journal ArticleDOI
TL;DR: An automated approach for annotating privacy policies in the DSE market is proposed to help DSE app developers to draft more comprehensible privacy policies as well as help their end-users to make more informed decisions in one of the fastest growing software ecosystems in the world.
Abstract: —Applications (apps) of the Digital Sharing Economy (DSE), such as Uber, Airbnb, and TaskRabbit, have become a main enabler of economic growth and shared prosperity in modern-day societies. However, the complex exchange of goods, services, and data that takes place over these apps frequently puts their end-users’ privacy at risk. Privacy policies of DSE apps are provided to disclose how private user data is being collected and handled. However, in reality, such policies are verbose and difficult to understand, leaving DSE users vulnerable to privacy intrusive practices. To address these concerns, in this paper, we propose an automated approach for annotating privacy policies in the DSE market. Our approach identifies data collection claims in these policies and maps them to the quality features of their apps. Visual and textual annotations are then used to further explain and justify these claims. The proposed approach is evaluated with 18 DSE app users. The results show that annotating privacy policies can significantly enhance their comprehensibility to the average DSE user. Our findings are intended to help DSE app developers to draft more comprehensible privacy policies as well as help their end-users to make more informed decisions in one of the fastest growing software ecosystems in the world.
Book ChapterDOI
01 Jan 2022
TL;DR: In this paper , the authors proposed an auto classify approach to classify the defects into impact categories as defined by Orthogonal Defect Classification (ODC), which is a popular model for classifying defects and it provides an in-depth analysis of the defects.
Abstract: AbstractSoftware systems have become an integral part of all the organizations. These systems are performing many critical operations. A defect in these systems affects the product quality and the software development process. Prediction of the impact category of these defects helps in improving defect management process as well as taking correct decisions to fix defects. Orthogonal defect classification is a popular model for classifying defects and it provides an in-depth analysis of the defects. In this study, we proposed an auto classify approach to classify the defects into impact categories as defined by Orthogonal Defect Classification (ODC). Bag of words, term frequency-inverse document frequency and word embedding have been used to represent the textual data into numeric vectors. For experimental work, we have used 4,096 reports form three NoSQL databases. We have trained and tested the proposed autoclassify approach using Support Vector Machine (SVM) and Random Forest Classifier (RFC). We achieved maximum accuracy 94% and 85.99% using SVM and RFC respectively.KeywordsOrthogonal defect classificationBag of wordsTerm frequency inverse document frequencyWord embedding
References
More filters
Journal Article
TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Abstract: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net.

47,974 citations

Journal ArticleDOI
Jacob Cohen1
TL;DR: In this article, the authors present a procedure for having two or more judges independently categorize a sample of units and determine the degree, significance, and significance of the units. But they do not discuss the extent to which these judgments are reproducible, i.e., reliable.
Abstract: CONSIDER Table 1. It represents in its formal characteristics a situation which arises in the clinical-social-personality areas of psychology, where it frequently occurs that the only useful level of measurement obtainable is nominal scaling (Stevens, 1951, pp. 2526), i.e. placement in a set of k unordered categories. Because the categorizing of the units is a consequence of some complex judgment process performed by a &dquo;two-legged meter&dquo; (Stevens, 1958), it becomes important to determine the extent to which these judgments are reproducible, i.e., reliable. The procedure which suggests itself is that of having two (or more) judges independently categorize a sample of units and determine the degree, significance, and

34,965 citations

Journal ArticleDOI
TL;DR: This work proposes a generative model for text and other collections of discrete data that generalizes or improves on several previous models including naive Bayes/unigram, mixture of unigrams, and Hofmann's aspect model.
Abstract: We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.

30,570 citations

Proceedings ArticleDOI
01 Oct 2014
TL;DR: A new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods and produces a vector space with meaningful substructure.
Abstract: Recent methods for learning vector space representations of words have succeeded in capturing fine-grained semantic and syntactic regularities using vector arithmetic, but the origin of these regularities has remained opaque. We analyze and make explicit the model properties needed for such regularities to emerge in word vectors. The result is a new global logbilinear regression model that combines the advantages of the two major model families in the literature: global matrix factorization and local context window methods. Our model efficiently leverages statistical information by training only on the nonzero elements in a word-word cooccurrence matrix, rather than on the entire sparse matrix or on individual context windows in a large corpus. The model produces a vector space with meaningful substructure, as evidenced by its performance of 75% on a recent word analogy task. It also outperforms related models on similarity tasks and named entity recognition.

30,558 citations

Proceedings Article
Tomas Mikolov1, Ilya Sutskever1, Kai Chen1, Greg S. Corrado1, Jeffrey Dean1 
05 Dec 2013
TL;DR: This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible.

24,012 citations