URL classification using non negative matrix factorization

doi:10.1109/NCC.2014.6811274

Home
/
Papers
/
URL classification using non negative matrix factorization

Proceedings Article•DOI•

URL classification using non negative matrix factorization

Shreya Khare¹, Akshay Bhandari¹, Hema A. Murthy¹•Institutions (1)

Indian Institute of Technology Madras¹

08 May 2014-pp 1-6

TL;DR: The objective of this paper is to proactively classify anomalous accesses to enable campus ISPs to deny access to users, misusing the Internet.

read less

Abstract: Internet availability on a campus is not metered. Internet link bandwidths are vulnerable as they can be misused. Moreover, websites blacklist campuses for misuse. Especially blacklisting by academic websites like IEEE and ACM can lead to serious researchers being denied access to information. The objective of this paper is to proactively classify anomalous accesses. This will enable campus ISPs to deny access to users, misusing the Internet. In particular URLs are classified using the short snippets(meta-data) that are available. New Features, namely random walk term weights, within class popularity in tandem with non negative matrix factorization show a lot of promise for classifying URLs. The classification accuracy is as a high as 92.96% on 10 gigabytes of proxy data.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Improving malicious URLs detection via feature engineering: Linear and nonlinear space transformation methods

[...]

Tie Li¹, Gang Kou², Yi Peng¹•Institutions (2)

University of Electronic Science and Technology of China¹, Southwestern University of Finance and Economics²

01 Jul 2020-Information Systems

TL;DR: The results showed that the proposed methods significantly improved the efficiency and performance of certain classifiers, such as k-Nearest Neighbor, Support Vector Machine, and neural networks.

...read moreread less

85 citations

Journal Article•DOI•

Multi-level anomaly detection: Relevance of big data analytics in networks

[...]

Saad Y. Sait¹, Akshay Bhandari¹, Shreya Khare¹, Cyriac James¹, Hema A. Murthy¹ - Show less +1 more•Institutions (1)

Indian Institute of Technology Madras¹

12 Nov 2015-Sadhana-academy Proceedings in Engineering Sciences

TL;DR: This work has focussed on anomaly-based intrusion detection in the campus environment at the network edge using machine learning and time series analysis applied at different layers in TCP/IP stack.

...read moreread less

Abstract: The Internet has become a vital source of information; internal and external attacks threaten the integrity of the LAN connected to the Internet. In this work, several techniques have been described for detection of such threats. We have focussed on anomaly-based intrusion detection in the campus environment at the network edge. A campus LAN consisting of more than 9000 users with a 90 Mbps internet access link is a large network. Therefore, efficient techniques are required to handle such big data and to model user behaviour. Proxy server logs of a campus LAN and edge router traces have been used for anomalies like abusive Internet access, systematic downloading (internal threats) and DDoS attacks (external threat); our techniques involve machine learning and time series analysis applied at different layers in TCP/IP stack. Accuracy of our techniques has been demonstrated through extensive experimentation on huge and varied datasets. All the techniques are applicable at the edge and can be integrated into a Network Intrusion Detection System.

...read moreread less

12 citations

Cites methods from "URL classification using non negati..."

...In this study, meta-data embedded in a HTML page has been used to perform web-page classification (Khare et al 2014)....
[...]

Posted Content•

[...]

Utkarsh Desai, Srikanth G. Tamilselvam, Jassimran Kaur, Senthil Mani, Shreya Khare - Show less +1 more

31 Jan 2020-arXiv: Computation and Language

TL;DR: This work extends the benchmark datasets along naturally occurring corruptions such as Spelling Errors, Text Noise and Synonyms and making them publicly available and finds that targeted corruptions can expose vulnerabilities of a model better than random choices in most cases.

...read moreread less

Abstract: Text classification models, especially neural networks based models, have reached very high accuracy on many popular benchmark datasets. Yet, such models when deployed in real world applications, tend to perform badly. The primary reason is that these models are not tested against sufficient real world natural data. Based on the application users, the vocabulary and the style of the model's input may greatly vary. This emphasizes the need for a model agnostic test dataset, which consists of various corruptions that are natural to appear in the wild. Models trained and tested on such benchmark datasets, will be more robust against real world data. However, such data sets are not easily available. In this work, we address this problem, by extending the benchmark datasets along naturally occurring corruptions such as Spelling Errors, Text Noise and Synonyms and making them publicly available. Through extensive experiments, we compare random and targeted corruption strategies using Local Interpretable Model-Agnostic Explanations(LIME). We report the vulnerabilities in two popular text classification models along these corruptions and also find that targeted corruptions can expose vulnerabilities of a model better than random choices in most cases.

...read moreread less

1 citations

Cites methods from "URL classification using non negati..."

...In the past, heavy feature engineering was applied to produce an acceptable text based model (Aggarwal and Zhai, 2012; Yang and Pedersen, 1997; Khare et al., 2014)....
[...]

References

PDF

Open Access

More filters

Journal Article•DOI•

WordNet: a lexical database for English

[...]

George A. Miller¹•Institutions (1)

Princeton University¹

01 Nov 1995-Communications of The ACM

TL;DR: WordNet1 provides a more effective combination of traditional lexicographic information and modern computing, and is an online lexical database designed for use under program control.

...read moreread less

Abstract: Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings. This information is traditionally provided through dictionaries, and machine-readable dictionaries are now widely available. But dictionary entries evolved for the convenience of human readers, not for machines. WordNet1 provides a more effective combination of traditional lexicographic information and modern computing. WordNet is an online lexical database designed for use under program control. English nouns, verbs, adjectives, and adverbs are organized into sets of synonyms, each representing a lexicalized concept. Semantic relations link the synonym sets [4].

...read moreread less

15,068 citations

"URL classification using non negati..." refers methods in this paper

...Stemming was performed using Word Net [13] which uses heuristics based approaches like Porter’s stemming [14] and combines words with same root....
[...]

Journal Article•DOI•

Indexing by Latent Semantic Analysis

[...]

Scott Deerwester¹, Susan T. Dumais², George W. Furnas², Thomas K. Landauer², Richard A. Harshman³ - Show less +1 more•Institutions (3)

University of Chicago¹, Telcordia Technologies², University of Western Ontario³

01 Sep 1990-Journal of the Association for Information Science and Technology

TL;DR: A new method for automatic indexing and retrieval to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries.

...read moreread less

Abstract: A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

...read moreread less

12,443 citations

"URL classification using non negati..." refers background in this paper

...Ideally, k is chosen large enough to fit all the real structure in the data but small enough so that unimportant details are removed [20]....
[...]
...A dimension reduction is therefore required for meaningful classification [20]....
[...]

Journal Article•DOI•

Term Weighting Approaches in Automatic Text Retrieval

[...]

Gerard Salton¹, Chris Buckley¹•Institutions (1)

Cornell University¹

01 Aug 1988-Information Processing and Management

TL;DR: This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

Abstract: The experimental evidence accumulated over the past 20 years indicates that textindexing systems based on the assignment of appropriately weighted single terms produce retrieval results that are superior to those obtainable with other more elaborate text representations. These results depend crucially on the choice of effective term weighting systems. This paper summarizes the insights gained in automatic term weighting, and provides baseline single term indexing models with which other more elaborate content analysis procedures can be compared.

...read moreread less

9,460 citations

"URL classification using non negati..." refers methods in this paper

...Using this model, a document is represented as a vector, whose components are the weight that we assign to each term in a document [16]....
[...]

Journal Article•DOI•

An algorithm for suffix stripping

[...]

M. F. Porter¹•Institutions (1)

University of Cambridge¹

01 Dec 1997-Program: Electronic Library and Information Systems

TL;DR: An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL, and performs slightly better than a much more elaborate system with which it has been compared.

...read moreread less

Abstract: The automatic removal of suffixes from words in English is of particular interest in the field of information retrieval. An algorithm for suffix stripping is described, which has been implemented as a short, fast program in BCPL. Although simple, it performs slightly better than a much more elaborate system with which it has been compared. It effectively works by treating complex suffixes as compounds made up of simple suffixes, and removing the simple suffixes in a number of steps. In each step the removal of the suffix is made to depend upon the form of the remaining stem, which usually involves a measure of its syllable length.

...read moreread less

7,572 citations

Proceedings Article•

Algorithms for Non-negative Matrix Factorization

[...]

Daniel D. Lee¹, H. Sebastian Seung²•Institutions (2)

Bell Labs¹, Massachusetts Institute of Technology²

01 Jan 2000

TL;DR: Two different multiplicative algorithms for non-negative matrix factorization are analyzed and one algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence.

...read moreread less

Abstract: Non-negative matrix factorization (NMF) has previously been shown to be a useful decomposition for multivariate data. Two different multiplicative algorithms for NMF are analyzed. They differ only slightly in the multiplicative factor used in the update rules. One algorithm can be shown to minimize the conventional least squares error while the other minimizes the generalized Kullback-Leibler divergence. The monotonic convergence of both algorithms can be proven using an auxiliary function analogous to that used for proving convergence of the Expectation-Maximization algorithm. The algorithms can also be interpreted as diagonally rescaled gradient descent, where the rescaling factor is optimally chosen to ensure convergence.

...read moreread less

7,345 citations