scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

URL classification using non negative matrix factorization

TL;DR: The objective of this paper is to proactively classify anomalous accesses to enable campus ISPs to deny access to users, misusing the Internet.
Abstract: Internet availability on a campus is not metered. Internet link bandwidths are vulnerable as they can be misused. Moreover, websites blacklist campuses for misuse. Especially blacklisting by academic websites like IEEE and ACM can lead to serious researchers being denied access to information. The objective of this paper is to proactively classify anomalous accesses. This will enable campus ISPs to deny access to users, misusing the Internet. In particular URLs are classified using the short snippets(meta-data) that are available. New Features, namely random walk term weights, within class popularity in tandem with non negative matrix factorization show a lot of promise for classifying URLs. The classification accuracy is as a high as 92.96% on 10 gigabytes of proxy data.
Citations
More filters
Journal ArticleDOI
TL;DR: The results showed that the proposed methods significantly improved the efficiency and performance of certain classifiers, such as k-Nearest Neighbor, Support Vector Machine, and neural networks.

85 citations

Journal ArticleDOI
TL;DR: This work has focussed on anomaly-based intrusion detection in the campus environment at the network edge using machine learning and time series analysis applied at different layers in TCP/IP stack.
Abstract: The Internet has become a vital source of information; internal and external attacks threaten the integrity of the LAN connected to the Internet. In this work, several techniques have been described for detection of such threats. We have focussed on anomaly-based intrusion detection in the campus environment at the network edge. A campus LAN consisting of more than 9000 users with a 90 Mbps internet access link is a large network. Therefore, efficient techniques are required to handle such big data and to model user behaviour. Proxy server logs of a campus LAN and edge router traces have been used for anomalies like abusive Internet access, systematic downloading (internal threats) and DDoS attacks (external threat); our techniques involve machine learning and time series analysis applied at different layers in TCP/IP stack. Accuracy of our techniques has been demonstrated through extensive experimentation on huge and varied datasets. All the techniques are applicable at the edge and can be integrated into a Network Intrusion Detection System.

12 citations


Cites methods from "URL classification using non negati..."

  • ...In this study, meta-data embedded in a HTML page has been used to perform web-page classification (Khare et al 2014)....

    [...]

Posted Content
TL;DR: This work extends the benchmark datasets along naturally occurring corruptions such as Spelling Errors, Text Noise and Synonyms and making them publicly available and finds that targeted corruptions can expose vulnerabilities of a model better than random choices in most cases.
Abstract: Text classification models, especially neural networks based models, have reached very high accuracy on many popular benchmark datasets. Yet, such models when deployed in real world applications, tend to perform badly. The primary reason is that these models are not tested against sufficient real world natural data. Based on the application users, the vocabulary and the style of the model's input may greatly vary. This emphasizes the need for a model agnostic test dataset, which consists of various corruptions that are natural to appear in the wild. Models trained and tested on such benchmark datasets, will be more robust against real world data. However, such data sets are not easily available. In this work, we address this problem, by extending the benchmark datasets along naturally occurring corruptions such as Spelling Errors, Text Noise and Synonyms and making them publicly available. Through extensive experiments, we compare random and targeted corruption strategies using Local Interpretable Model-Agnostic Explanations(LIME). We report the vulnerabilities in two popular text classification models along these corruptions and also find that targeted corruptions can expose vulnerabilities of a model better than random choices in most cases.

1 citations


Cites methods from "URL classification using non negati..."

  • ...In the past, heavy feature engineering was applied to produce an acceptable text based model (Aggarwal and Zhai, 2012; Yang and Pedersen, 1997; Khare et al., 2014)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.
Abstract: Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].

15,068 citations


"URL classification using non negati..." refers methods in this paper

  • ...Stemming was performed using Word Net [13] which uses heuristics based approaches like Porter’s stemming [14] and combines words with same root....

    [...]

Journal ArticleDOI
TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.
Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

12,443 citations


"URL classification using non negati..." refers background in this paper

  • ...Ideally, k is chosen large enough to fit all the real structure in the data but small enough so that unimportant details are removed [20]....

    [...]

  • ...A dimension reduction is therefore required for meaningful classification [20]....

    [...]

Journal ArticleDOI
TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.
Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

9,460 citations


"URL classification using non negati..." refers methods in this paper

  • ...Using this model, a document is represented as a vector, whose components are the weight that we assign to each term in a document [16]....

    [...]

Journal ArticleDOI
TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.
Abstract: The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

7,572 citations

Proceedings Article
01 Jan 2000
TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.
Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

7,345 citations


"URL classification using non negati..." refers background in this paper

  • ...It forms additive parts-based representation of the data [22]....

    [...]