scispace - formally typeset
Search or ask a question

Showing papers by "Paul Cook published in 2018"


Proceedings ArticleDOI
27 Aug 2018
TL;DR: This work proposes to develop a lightweight system that can generate signatures of malware writers by leveraging the string components present in their Android binaries, and can effectively detect a wide range of existing, as well as any new, malware samples generated by particular authors.
Abstract: With the rising popularity of Android mobile devices, the amount of malicious applications targeting the Android platform has been increasing tremendously. To mitigate the risk of malicious apps, there is a need for an automated system to detect these applications. Current detection techniques rely on the signatures of well-documented malware, and hence may not be able to detect new malware samples. Instead of generating signatures for malware samples themselves, in this work, we propose to develop a lightweight system that can generate signatures of malware writers by leveraging the string components present in their Android binaries. Using these author signatures, we can effectively detect a wide range of existing, as well as any new, malware samples generated by particular authors. The proposed system achieved 98%, 96%, and 71% accuracy over datasets of 1559 benign, 262 malicious, and 96 obfuscated Android applications, respectively. The string-based approach achieved 71% of accuracy compared to only 50% obtained with the existing Ding and Samadzadeh's system.

25 citations


Proceedings ArticleDOI
01 Jul 2018
TL;DR: The results show that a model based on averaging word embeddings performs on par with, or better than, a previously-proposed approach based on skip-thoughts, based on a variety of approaches to forming distributed representations.
Abstract: Verb-noun combinations (VNCs) - e.g., blow the whistle, hit the roof, and see stars - are a common type of English idiom that are ambiguous with literal usages. In this paper we propose and evaluate models for classifying VNC usages as idiomatic or literal, based on a variety of approaches to forming distributed representations. Our results show that a model based on averaging word embeddings performs on par with, or better than, a previously-proposed approach based on skip-thoughts. Idiomatic usages of VNCs are known to exhibit lexico-syntactic fixedness. We further incorporate this information into our models, demonstrating that this rich linguistic knowledge is complementary to the information carried by distributed representations.

10 citations


Proceedings Article
01 Aug 2018
TL;DR: Experimental results on two kinds of MWEs and two languages suggest that character-level neural network language models capture knowledge of multiword expression compositionality, in particular for English noun compounds and the particle component of English verb-particle constructions.
Abstract: In this paper, we propose the first model for multiword expression (MWE) compositionality prediction based on character-level neural network language models. Experimental results on two kinds of MWEs (noun compounds and verb-particle constructions) and two languages (English and German) suggest that character-level neural network language models capture knowledge of multiword expression compositionality, in particular for English noun compounds and the particle component of English verb-particle constructions. In contrast to many other approaches to MWE compositionality prediction, this character-level approach does not require token-level identification of MWEs in a training corpus, and can potentially predict the compositionality of out-of-vocabulary MWEs.

5 citations


Proceedings Article
01 May 2018
TL;DR: This paper first constructs and analyzes a web corpus of Mi’kmaq, then evaluates several approaches to language modelling for Mi'kmaQ, including character-level models that are particularly well-suited to morphologically-rich languages.
Abstract: Mi’kmaq is a polysynthetic Indigenous language spoken primarily in Eastern Canada, on which no prior computational work has focused. In this paper we first construct and analyze a web corpus of Mi’kmaq. We then evaluate several approaches to language modelling for Mi’kmaq, including character-level models that are particularly well-suited to morphologically-rich languages. Preservation of Indigenous languages is particularly important in the current Canadian context; we argue that natural language processing could aid such efforts.

4 citations


Proceedings ArticleDOI
01 Jun 2018
TL;DR: Three unsupervised models for capturing discriminative attributes based on information from word embeddings, WordNet, and sentence-level word co-occurrence frequency are presented and it is shown that the simple approach based on word co,occurrence performs best.
Abstract: In this paper we present three unsupervised models for capturing discriminative attributes based on information from word embeddings, WordNet, and sentence-level word co-occurrence frequency. We show that, of these approaches, the simple approach based on word co-occurrence performs best. We further consider supervised and unsupervised approaches to combining information from these models, but these approaches do not improve on the word co-occurrence model.

3 citations


Book ChapterDOI
08 May 2018
TL;DR: An author verification task in the realm of blog posts to detect and block unauthorized users based on the textual content of their unauthorized post, using different methods to represent a document, such as word frequency and word2vec.
Abstract: Although social media platforms can assist organizations’ progress, they also make them vulnerable to unauthorized users gaining access to their account and posting as the organization. This can have negative effects on the company’s public appearance and profit. Once attackers gain access to a social media account, they are able to post any content from that account. In this paper, we propose an author verification task in the realm of blog posts to detect and block unauthorized users based on the textual content of their unauthorized post. We use different methods to represent a document, such as word frequency and word2vec, and we train two different classifiers over these document representations. The experimental results show that regardless of the classifier the word2vec method outperforms other representations.

3 citations