scispace - formally typeset
Search or ask a question
Author

R. Rajalakshmi

Other affiliations: Valliammai Engineering College
Bio: R. Rajalakshmi is an academic researcher from VIT University. The author has contributed to research in topics: Web page & Feature (machine learning). The author has an hindex of 5, co-authored 11 publications receiving 76 citations. Previous affiliations of R. Rajalakshmi include Valliammai Engineering College.

Papers
More filters
Journal ArticleDOI
01 Feb 2018
TL;DR: A new feature weighting method for a Naive Bayes classifier is proposed by embedding the term goodness obtained from the feature selection method by using posterior probability for determining the confidence score.
Abstract: Web page classification has become a challenging task due to the exponential growth of the World Wide Web. Uniform Resource Locator (URL)-based web page classification systems play an important role, but high accuracy may not be achievable as URL contains minimal information. Nevertheless, URL-based classifiers along with rejection framework can be used as a first-level filter in a multistage classifier, and a costlier feature extraction from contents may be done in later stages. However, noisy and irrelevant features present in URL demand feature selection methods for URL classification. Therefore, we propose a supervised feature selection method by which relevant URL features are identified using statistical methods. We propose a new feature weighting method for a Naive Bayes classifier by embedding the term goodness obtained from the feature selection method. We also propose a rejection framework to the Naive Bayes classifier by using posterior probability for determining the confidence score. The proposed method is evaluated on the Open Directory Project and WebKB data sets. Experimental results show that our method can be an effective first-level filter. McNemar tests confirm that our approach significantly improves the performance.

27 citations

Journal ArticleDOI
01 May 2021
TL;DR: In the proposed CBRNN model, the CNN layer extracts the rich set of phrase‐level features and BGRU captures the chronological features through long term dependency in a multi‐layered sentence and outperforms the state of the art by 2%‐4% on these two datasets.
Abstract: Sentiment analysis is the process of extracting the opinions of customers from online reviews. In general, customers express their reviews in natural language. It becomes a complex task when applying sentiment analysis on those reviews. In earlier stages, word‐level features with various feature weighting methods such as Bag of Words, TF‐IDF, and Word2Vec were applied for sentiment analysis and deep learning networks are not explored much. We considered phrase level and sentence level features instead of applying word‐level features for sentiment analysis and also enhanced by applying various deep learning techniques. In this article, we have proposed a hybrid convolutional bidirectional recurrent neural network model (CBRNN) by combining two‐layer convolutional neural network (CNN) with a bidirectional gated recurrent unit (BGRU). In the proposed CBRNN model, the CNN layer extracts the rich set of phrase‐level features and BGRU captures the chronological features through long term dependency in a multi‐layered sentence. The proposed approach was evaluated on two benchmark datasets and compared with various baselines. The experimental results show that the proposed hybrid model provides better results than any other models with an F1 score of 87.62% and 77.4% on IMDB and Polarity datasets,respectively. Our CBRNN model outperforms the state of the art by 2%‐4% on these two datasets. It is also observed that, the time taken for training is slightly higher than the existing approaches with the substantial improvement in the performance.

19 citations

Book ChapterDOI
19 Sep 2018
TL;DR: This work has proposed a transfer learning technique by combining the best performing Convolutional Neural Network with the machine learning algorithms such as Naive Bayes classifier for detection and classification of DGA generated domains.
Abstract: Malware domains generated by Domain Generated Algorithms (DGA) are highly dynamic in nature. The traditional approach of blacklisting the malicious domains is a time consuming approach and are not effective, as the DGA randomly generate the domain names for the malware. For real-time applications, malware detection is to be performed on the fly and hence sophisticated techniques are in demand to address this issue. Even though various machine learning techniques are employed for this purpose, the performance of such algorithms depends on how good the features are designed. In this work, we have proposed a transfer learning technique by combining the best performing Convolutional Neural Network with the machine learning algorithms such as Naive Bayes classifier for detection and classification of DGA generated domains. We have evaluated our approach using the dataset released by DMD 2018 Shared Task for both binary classification and multiclass classification scenario. Our methodology of CNN with NB for binary classification has been awarded the first rank in this DMD 2018 shared task.

18 citations

Book ChapterDOI
01 Jan 2019
TL;DR: This work concludes that Word2Vec with SGD is the best combination for sentiment classification problem on IMDB dataset and can be used as a base for future exploration of opinioned value on any textual data.
Abstract: Sentiment analysis is a method of extracting subjective information from customer reviews. The analysis helps to reveal the consumer insights about the product, a theme, or a service. In the existing literature, various methods such as BoW and TF-IDF are employed for sentiment analysis and deep learning methods are not explored much. We made an attempt to apply Word2Vec feature weighting method for this problem. We carried out various experiments for sentiment analysis on a large dataset IMDB that contains movie review. We compared various feature weighting methods and analyzed using different classifiers, and the best combination was determined. From the experimental results, we conclude that Word2Vec with SGD is the best combination for sentiment classification problem on IMDB dataset. The result shown in the paper can be used as a base for future exploration of opinioned value on any textual data.

16 citations

Proceedings ArticleDOI
01 Oct 2018
TL;DR: An automated way of learning category specific universal dictionary of discriminating URL features is proposed, using this automatically learnt dictionary, the feature vector dimensionality is made independent of training set and it overcomes the difficulty of handling large scale data.
Abstract: Ever growing World Wide Web results in a large volume of web pages with variety of topics. Many applications such as information filtering and focused crawling demand large scale topic classification of a web page. To classify the web pages, URL based approach is proposed by which downloading the contents of the web page for classification purpose is avoided. In this paper, an automated way of learning category specific universal dictionary of discriminating URL features is proposed. Using this automatically learnt dictionary, the feature vector dimensionality is made independent of training set and it overcomes the difficulty of handling large scale data. For constructing this dictionary, publicly available ODP dataset have been used. The proposed approach was evaluated by applying the automatically learnt URL feature dictionaries on another dataset that contains search results from Google. Through experiments, it is shown that macro-average precision, recall and F1 values of 0.93, 0.85 and 0.88 have been achieved. We have observed that, the difference is not statistically significant when the universal dictionary is applied instead of using dataset-specific term dictionary.

14 citations


Cited by
More filters
Posted Content
TL;DR: This article presents the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.).
Abstract: Malicious URL, a.k.a. malicious website, is a common and serious threat to cybersecurity. Malicious URLs host unsolicited content (spam, phishing, drive-by exploits, etc.) and lure unsuspecting users to become victims of scams (monetary loss, theft of private information, and malware installation), and cause losses of billions of dollars every year. It is imperative to detect and act on such threats in a timely manner. Traditionally, this detection is done mostly through the usage of blacklists. However, blacklists cannot be exhaustive, and lack the ability to detect newly generated malicious URLs. To improve the generality of malicious URL detectors, machine learning techniques have been explored with increasing attention in recent years. This article aims to provide a comprehensive survey and a structural understanding of Malicious URL Detection techniques using machine learning. We present the formal formulation of Malicious URL Detection as a machine learning task, and categorize and review the contributions of literature studies that addresses different dimensions of this problem (feature representation, algorithm design, etc.). Further, this article provides a timely and comprehensive survey for a range of different audiences, not only for machine learning researchers and engineers in academia, but also for professionals and practitioners in cybersecurity industry, to help them understand the state of the art and facilitate their own research and practical applications. We also discuss practical issues in system design, open research challenges, and point out some important directions for future research.

200 citations

Journal ArticleDOI
TL;DR: An approach to evaluate trust prediction and confusion matrix to rank web services from throughput and response time and correct prediction of trusted and untrusted users in web services invocation has improved the overall selection process in a pool of similar web services.
Abstract: To accurately rank various web services can be a very challenging task depending on the evaluation criteria used, however, it can play an important role in performing a better selection of web services afterward. This paper proposes an approach to evaluate trust prediction and confusion matrix to rank web services from throughput and response time. AdaBoostM1 and J48 classifiers are used as binary classifiers on a benchmark web services dataset. The trust score (TS) measuring method is proposed by using the confusion matrix to determine trust scores of all web services. Trust prediction is calculated using 5-Fold, 10-Fold, and 15-Fold cross-validation methods. The reported results showed that the web service 1 (WS1) was most trusted with (48.5294%) TS value, and web service 2 (WS2) was least trusted with (24.0196%) TS value by users. Correct prediction of trusted and untrusted users in web services invocation has improved the overall selection process in a pool of similar web services. Kappa statistics values are used for the evaluation of the proposed approach and for performance comparison of the two above-mentioned classifiers.

68 citations

Journal ArticleDOI
TL;DR: This paper uses feature weighting and Laplace calibration to improve the naive Bayesian classification algorithm, and obtains the improved naive Bayes classification algorithm.
Abstract: Naive Bayesian classification algorithm is widely used in big data analysis and other fields because of its simple and fast algorithm structure. Aiming at the shortcomings of the naive Bayes classification algorithm, this paper uses feature weighting and Laplace calibration to improve it, and obtains the improved naive Bayes classification algorithm. Through numerical simulation, it is found that when the sample size is large, the accuracy of the improved naive Bayes classification algorithm is more than 99%, and it is very stable; when the sample attribute is less than 400 and the number of categories is less than 24, the accuracy of the improved naive Bayes classification algorithm is more than 95%. Through empirical research, it is found that the improved naive Bayes classification algorithm can greatly improve the correct rate of discrimination analysis from 49.5 to 92%. Through robustness analysis, the improved naive Bayes classification algorithm has higher accuracy.

30 citations

Journal ArticleDOI
01 Feb 2018
TL;DR: A new feature weighting method for a Naive Bayes classifier is proposed by embedding the term goodness obtained from the feature selection method by using posterior probability for determining the confidence score.
Abstract: Web page classification has become a challenging task due to the exponential growth of the World Wide Web. Uniform Resource Locator (URL)-based web page classification systems play an important role, but high accuracy may not be achievable as URL contains minimal information. Nevertheless, URL-based classifiers along with rejection framework can be used as a first-level filter in a multistage classifier, and a costlier feature extraction from contents may be done in later stages. However, noisy and irrelevant features present in URL demand feature selection methods for URL classification. Therefore, we propose a supervised feature selection method by which relevant URL features are identified using statistical methods. We propose a new feature weighting method for a Naive Bayes classifier by embedding the term goodness obtained from the feature selection method. We also propose a rejection framework to the Naive Bayes classifier by using posterior probability for determining the confidence score. The proposed method is evaluated on the Open Directory Project and WebKB data sets. Experimental results show that our method can be an effective first-level filter. McNemar tests confirm that our approach significantly improves the performance.

27 citations

Journal ArticleDOI
TL;DR: In this paper , a movie recommendation system using Cosine Similarity to recommend similar movies based on the one chosen by the user is described, this system performs sentiment analysis on the reviews of the movie chosen using machine learning.
Abstract: In the modern world, where technology is at the forefront of every industry, there has been an overload of information and data. Thus, a recommendation system comes in handy to deal with this large volume of data and filter out the useful information which is fast and relevant to the user's choice. This paper describes an approach to a movie recommendation system using Cosine Similarity to recommend similar movies based on the one chosen by the user. Although the existing recommendation systems get the job done, it does not justify if the movie is worth spending time on. To enhance the user experience, this system performs sentiment analysis on the reviews of the movie chosen using machine learning. Two of the supervised machine learning algorithms Naïve Bayes (NB) Classifier and Support Vector Machine (SVM) Classifier are used to increase the accuracy and efficiency. This paper also gives a comparison between NB and SVM on the basis of parameters like Accuracy, Precision, Recall and F1 Score. The accuracy score of SVM came out to be 98.63% whereas accuracy score of NB is 97.33%. Thus, SVM outweighs NB and proves to be a better fit for Sentiment Analysis.

26 citations