scispace - formally typeset
Search or ask a question

Showing papers by "Vipin Kumar published in 2005"


Proceedings ArticleDOI
21 Aug 2005
TL;DR: A novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed, which combines results from multiple outlier detection algorithms that are applied using different set of features.
Abstract: Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

622 citations



Book ChapterDOI
01 Jan 2005
TL;DR: This chapter provides the overview of the state of the art in intrusion detection research and provides taxonomy of computer intrusions, along with brief descriptions of major computer attack categories.
Abstract: This chapter provides the overview of the state of the art in intrusion detection research. Intrusion detection systems are software and/or hardware components that monitor computer systems and analyze events occurring in them for signs of intrusions. Due to widespread diversity and complexity of computer infrastructures, it is difficult to provide a completely secure computer system. Therefore, there are numerous security systems and intrusion detection systems that address different aspects of computer security. This chapter first provides taxonomy of computer intrusions, along with brief descriptions of major computer attack categories. Second, a common architecture of intrusion detection systems and their basic characteristics are presented. Third, taxonomy of intrusion detection systems based on five criteria (information source, analysis strategy, time aspects, architecture, response) is given. Finally, intrusion detection systems are classified according to each of these categories and the most representative research prototypes are briefly described.

215 citations


Journal ArticleDOI
TL;DR: This paper proposes an LDA-based incremental dimension reduction algorithm, called IDR/QR, which applies QR decomposition rather than SVD, which does not require the whole data matrix in main memory, which is desirable for large data sets.
Abstract: Dimension reduction is a critical data preprocessing step for many database and data mining applications, such as efficient storage and retrieval of high-dimensional data. In the literature, a well-known dimension reduction algorithm is linear discriminant analysis (LDA). The common aspect of previously proposed LDA-based algorithms is the use of singular value decomposition (SVD). Due to the difficulty of designing an incremental solution for the eigenvalue problem on the product of scatter matrices in LDA, there has been little work on designing incremental LDA algorithms that can efficiently incorporate new data items as they become available. In this paper, we propose an LDA-based incremental dimension reduction algorithm, called IDR/QR, which applies QR decomposition rather than SVD. Unlike other LDA-based algorithms, this algorithm does not require the whole data matrix in main memory. This is desirable for large data sets. More importantly, with the insertion of new data items, the IDR/QR algorithm can constrain the computational cost by applying efficient QR-updating techniques. Finally, we evaluate the effectiveness of the IDR/QR algorithm in terms of classification error rate on the reduced dimensional space. Our experiments on several real-world data sets reveal that the classification error rate achieved by the IDR/QR algorithm is very close to the best possible one achieved by other LDA-based algorithms. However, the IDR/QR algorithm has much less computational cost, especially when new data items are inserted dynamically.

127 citations


Journal ArticleDOI
27 Nov 2005
TL;DR: This paper formulate the problem of summarization of a data set of transactions with categorical attributes as an optimization problem involving two objective functions – compaction gain and information loss and proposes metrics to characterize the output of any summarization algorithm.
Abstract: In this paper, we formulate the problem of summarization of a dataset of transactions with categorical attributes as an optimization problem involving two objective functions - compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent item sets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Off-line Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program.

117 citations


Proceedings ArticleDOI
10 May 2005
TL;DR: This paper exploits the underlying principle of first order markov model on which PageRank is based, to incrementally compute PageRank for the evolving Web graph, and shows significant speed up in computational cost.
Abstract: Link Analysis has been a popular and widely used Web mining technique, especially in the area of Web search. Various ranking schemes based on link analysis have been proposed, of which the PageRank metric has gained the most popularity with the success of Google. Over the last few years, there has been significant work in improving the relevance model of PageRank to address issues such as personalization and topic relevance. In addition, a variety of ideas have been proposed to address the computational aspects of PageRank, both in terms of efficient I/O computations and matrix computations involved in computing the PageRank score. The key challenge has been to perform computation on very large Web graphs. In this paper, we propose a method to incrementally compute PageRank for a large graph that is evolving. We note that although the Web graph evolves over time, its rate of change is rather slow. When compared to its size. We exploit the underlying principle of first order markov model on which PageRank is based, to incrementally compute PageRank for the evolving Web graph. Our experimental results show significant speed up in computational cost, the computation involves only the (small) portion of Web graph that has undergone change. Our approach is quite general, and can be used to incrementally compute (on evolving graphs) any metric that satisfies the first order Markov property.

101 citations


Journal ArticleDOI
TL;DR: The Minnesota Intrusion Detection System can detect sophisticated cyberattacks on large-scale networks that are hard to detect using signature-based systems.
Abstract: Parallel and distributed data mining offer great promise for addressing cybersecurity. The Minnesota Intrusion Detection System can detect sophisticated cyberattacks on large-scale networks that are hard to detect using signature-based systems.

82 citations


Journal ArticleDOI
TL;DR: In this article, a 19-year record of global satellite observations of vegetation phenology from the advanced very high resolution radiometer (AVHRR) was used to characterize major ecosystem disturbance events and regimes.
Abstract: Ecosystem structure and function are strongly affected by disturbance events, many of which in North America are associated with seasonal temperature extremes, wildfires, and tropical storms. This study was conducted to evaluate patterns in a 19-year record of global satellite observations of vegetation phenology from the advanced very high resolution radiometer (AVHRR) as a means to characterize major ecosystem disturbance events and regimes. The fraction absorbed of photosynthetically active radiation (FPAR) by vegetation canopies worldwide has been computed at a monthly time interval from 1982 to 2000 and gridded at a spatial resolution of 8–km globally. Potential disturbance events were identified in the FPAR time series by locating anomalously low values (FPAR-LO) that lasted longer than 12 consecutive months at any 8-km pixel. We can find verifiable evidence of numerous disturbance types across North America, including major regional patterns of cold and heat waves, forest fires, tropical storms, and large-scale forest logging. Summed over 19 years, areas potentially influenced by major ecosystem disturbances (one FPAR-LO event over the period 1982–2000) total to more than 766,000 km2. The periods of highest detection frequency were 1987–1989, 1995–1997, and 1999. Sub-continental regions of the Pacific Northwest, Alaska, and Central Canada had the highest proportion (>90%) of FPAR-LO pixels detected in forests, tundra shrublands, and wetland areas. The Great Lakes region showed the highest proportion (39%) of FPAR-LO pixels detected in cropland areas, whereas the western United States showed the highest proportion (16%) of FPAR-LO pixels detected in grassland areas. Based on this analysis, an historical picture is emerging of periodic droughts and heat waves, possibly coupled with herbivorous insect outbreaks, as among the most important causes of ecosystem disturbance in North America.

53 citations



Proceedings ArticleDOI
06 Jun 2005
TL;DR: In this paper, a vector of weights is assigned to each vertex, and the goal is to produce a k-way partitioning such that the partitioning satisfies a balancing constraint associated with each weight, while attempting to minimize the edge-cut.
Abstract: Traditional graph partitioning algorithms compute a k-way partitioning of a graph such that the number of edges that are cut by the partitioning is minimized and each partition has an equal number of vertices. The task of minimizing the edge-cut can be considered as the objective and the requirement that the partitions will be of the same size can be considered as the constraint. In this paper we extend the partitioning problem by incorporating an arbitrary number of balancing constraints. In our formulation, a vector of weights is assigned to each vertex, and the goal is to produce a k-way partitioning such that the partitioning satisfies a balancing constraint associated with each weight, while attempting to minimize the edge-cut. Applications of this multi-constraint graph partitioning problem include parallel solution of multi-physics and multi-phase computations, that underlay many existing and emerging large-scale scientific simulations. We present new multi-constraint graph partitioning algorithms that are based on the multilevel graph partitioning paradigm. Our work focuses on developing new types of heuristics for coarsening, initial partitioning, and refinement that are capable of successfully handling multiple constraints. We experimentally evaluate the effectiveness of our multi-constraint partitioners on a variety of synthetically generated problems.

34 citations


Journal ArticleDOI
01 Dec 2005
TL;DR: In this paper, the authors analyzed 17 yr (1982-1998) of net carbon flux predictions from a simulation model based on satellite observations of monthly vegetation cover and found that although the terrestrial ecosystem sink for atmospheric CO2 for the Eurasian region has been fairly consistent at between 0.3 and 0.6 Pg C per year since 1988, high interannual variability in net ecosystem production (NEP) fluxes can be readily identified at locations across the continent.
Abstract: We have analyzed 17 yr (1982–1998) of net carbon flux predictions from a simulation model based on satellite observations of monthly vegetation cover. The NASA-CASA model was driven by vegetation cover properties derived from the Advanced Very High Resolution Radiometer and radiative transfer algorithms that were developed for the Moderate Resolution Imaging Spectroradiometer (MODIS). We report that although the terrestrial ecosystem sink for atmospheric CO2 for the Eurasian region has been fairly consistent at between 0.3 and 0.6 Pg C per year since 1988, high interannual variability in net ecosystem production (NEP) fluxes can be readily identified at locations across the continent. Ten major areas of highest variability in NEP were detected: eastern Europe, the Iberian Peninsula, the Balkan states, Scandinavia, northern and western Russia, eastern Siberia, Mongolia and western China, and central India. Analysis of climate anomalies over this 17-yr time period suggests that variability in precipitation and surface solar irradiance could be associated with trends in carbon sink fluxes within such regions of high NEP variability.

Proceedings ArticleDOI
27 Nov 2005
TL;DR: An approach to defining confidence for error tolerant itemsets that preserves the interpretation of confidence as a conditional probability and derive a confidence measure for continuous data that agrees with the standard confidence measure when applied to binary transaction data is described.
Abstract: In this paper, we explore extending association analysis to non-traditional types of patterns and nonbinary data by generalizing the notion of confidence. The key idea is to regard confidence as a measure of the extent to which the strength of one association pattern provides information about the strength of another. This approach provides a framework that encompasses the traditional concept of confidence as a special case and can be used as the basis for designing a variety of new confidence measures. Besides discussing such confidence measures, we provide examples that illustrate the potential usefulness of a generalized notion of confidence. In particular, we describe an approach to defining confidence for error tolerant itemsets that preserves the interpretation of confidence as a conditional probability and derive a confidence measure for continuous data that agrees with the standard confidence measure when applied to binary transaction data.

Proceedings ArticleDOI
31 Oct 2005
TL;DR: This paper presents a data mining framework using semi-supervised learning that demonstrates the potential for privacy leakage in multi-relational databases and introduces a new approach to semi-Supervised learning, hyperclique pattern based semi- Supervised learning (HPSL), which differs from traditional semi- supervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects.
Abstract: In multi-relational databases, a view, which is a context- and content-dependent subset of one or more tables (or other views), is often used to preserve privacy by hiding sensitive information. However, recent developments in data mining present a new challenge for database security even when traditional database security techniques, such as database access control, are employed. This paper presents a data mining framework using semi-supervised learning that demonstrates the potential for privacy leakage in multi-relational databases. Many different types of semi-supervised learning techniques, such as the K-nearest neighbor (KNN) method, can be used to demonstrate privacy leakage. However, we also introduce a new approach to semi-supervised learning, hyperclique pattern based semi-supervised learning (HPSL), which differs from traditional semi-supervised learning approaches in that it considers the similarity among groups of objects instead of only pairs of objects. Our experimental results show that both the KNN and HPSL methods have the ability to compromise database security, although HPSL is better at this privacy violation than the KNN method.

Journal ArticleDOI
TL;DR: Key advances in the remote sensing science are summarized in this article, which is particularly focused on information that would not be possible to be retrieved without the concurrence of this technology.
Abstract: This paper aims to assess the contribution of remote sensing technology in addressing key questions raised by the Large Scale Biosphere-Atmosphere Experiment in Amazonia (LBA). The answers to these questions foster the knowledge on the climatic, biogechemical and hydrologic functioning of the Amazon, as well as on the impact of human activities at regional and global scales. Remote sensing methods allow integrating information on several processes at different temporal and spatial scales. By doing so, it is possible to perceive hidden relations among processes and structures, enhancing their teleconnections. Key advances in the remote sensing science are summarized in this article, which is particularly focused on information that would not be possible to be retrieved without the concurrence of this technology.

Proceedings Article
01 Jan 2005
TL;DR: There is extensive overlap between stages when 5 specific markers are used and there is a mixing of classes especially in stage 1 and stage 2 in fibro test.
Abstract: L A B E L S 19 5 10 30 0 f4 12 7 8 36 0 f3 4 6 20 70 0 f2 4 3 15 165 0 f1 0 0 0 12 0 f0 f4 f3 f2 f1 f0 For such a medical problem it would be ideal to get the predictions as close to the diagonal. There is a mixing of classes as shown by the red marked region. (Note: There is a mixing in classes especially in stage 1 and stage 2 in fibro test shown below) FibroTest ® Note extensive overlap between stages when 5 specific markers are used. 3 of 5 markers are not commonly used tests

Journal ArticleDOI
TL;DR: In this article, the authors used the Advanced Very High Resolution Radiometer and radiative transfer algorithms that were developed for the Moderate Resolution Imaging Spectroradiometer (MODIS) to predict the terrestrial ecosystem flux for atmospheric CO2 for the Amazon region of South America.
Abstract: Seventeen years (1982–98) of net carbon flux predictions for Southern Hemisphere continents have been analyzed, based on a simulation model using satellite observations of monthly vegetation cover. The NASA Carnegie Ames Stanford Approach (CASA) model was driven by vegetation-cover properties derived from the Advanced Very High Resolution Radiometer and radiative transfer algorithms that were developed for the Moderate Resolution Imaging Spectroradiometer (MODIS). The terrestrial ecosystem flux for atmospheric CO2 for the Amazon region of South America has been predicted between a biosphere source of –0.17 Pg C per year (in 1983) and a biosphere sink of +0.64 Pg C per year (in 1989). The areas of highest variability in net ecosystem production (NEP) fluxes across all of South America were detected in the south-central rain forest areas of the Amazon basin and in southeastern Brazil. Similar levels of variability were recorded across central forested portions of Africa and in the southern horn of ...

01 Jan 2005
TL;DR: Experiments suggest that one of the proposed methods, UPR is promising and has a number of desirable properties, generalizing PageRank and inheriting basic PageRank properties, it is also stable and flexible.
Abstract: This thesis explores the possibility of incorporating usage statistics to improve ranking quality in site specific and intranet search engines. A number of usage based ranking approaches are introduced including a PageRank extension, Usage aware PageRank (UPR), an extension to HITS (UNITS), and a naive approach that uses the number of visits to pages as a quality measure. These methods are compared against each other and against two major link analysis approaches: PageRank and HITS. Weighting schemes that take into account the probability of visiting a page directly (by typing or via bookmarks), as well as the relative probability of following a particular link from a given page are explored. Both of these probabilities can be approximated from usage logs. Experimental results are carried out using a site specific search engine incorporating the above methods, using 6+ months of usage logs centered around the snapshot. The parameter space for UPR and UNITS are sampled to examine the effects of varying usage emphasis factors. Experiments suggest that one of the proposed methods, UPR is promising and has a number of desirable properties, generalizing PageRank and inheriting basic PageRank properties. It is also stable and flexible. Usage based signals such as UPR, can be especially useful in an intranet/site specific search setting, where documents tend to be poorly connected compared to the Web, but inherently, there is no or very little incentive for spamming.


Journal Article
TL;DR: In this article, a laboratory experiment was conducted to find out the vase-life of cut-blooms of chrysanthemum and the results showed that applying 20g N pot−1 Potash application of 16 Pot−1 was equally effective in improving vase life.
Abstract: A laboratory experiment was conducted to find out the vase-life of cut-blooms of chrysanthemum. Maximum vase-life (12.15–12.50 days) was recorded with 0.5% sucrose followed by water (control) and sodium benzoate, when flowers were taken from potted plants received 20g N pot−1 Potash application of 16 Pot−1 was equally effective in improving the vase-life of flower dipped in 0.5% sucrose solution. However dipping of flowers in sodium benzoate (0.5%) failed to enhance vase-life significantly. Pinching after 20 days of repotting was significantly effective in increasing the vase-life of flower treated with water followed by sodium benzoate (0.5%) however, there was no significant effect of dipping with sucrose solution (0.5%) on vase-life during both the year.


Proceedings ArticleDOI
06 Jun 2005
TL;DR: Locally-Matched Multilevel Scratch-Remap (LMSR) and Wavefront Diffusion (WDF) as mentioned in this paper are two algorithms for adaptive repartitioning.
Abstract: One ingredient which is viewed as vital to the successful conduct of many large-scale numerical simulations is the ability to dynamically repartition the underlying adaptive finite element mesh among the processors so that the computations are balanced and interprocessor communication is minimized. This requires that a sequence of partitions of the computational mesh be computed during the course of the computation in which the amount of data migration necessary to realize subsequent partitions is minimized, while all of the domains of a given partition contain a roughly equal amount of computational weight. Recently, parallel multilevel graph repartitioning techniques have been developed that can quickly compute high-quality repartitions for adaptive and dynamic meshes while minimizing the amount of data which needs to be migrated between processors. These algorithms can be categorized as either schemes which compute a new partition from scratch and then intelligently remap this partition to the original partition (hereafter referred to as scratch-remap schemes), or multilevel diffusion schemes. Scratch-remap schemes work quite well for graphs which are highly imbalanced in localized areas. On slightly to moderately imbalanced graphs and those in which imbalance occurs globally throughout the graph, however, they result in excessive vertex migration compared to multilevel diffusion algorithms. On the other hand, diffusion- based schemes work well for slightly imbalanced graphs and for those in which imbalance occurs globally throughout the graph. However, these schemes perform poorly on graphs that are highly imbalanced in localized areas, as the propagation of diffusion over long distances results in excessive edge-cut and vertex migration results. In this paper, we present two new schemes for adaptive repartitioning: Locally-Matched Multilevel Scratch-Remap (or LMSR) and Wavefront Diffusion. The LMSR scheme performs purely local coarsening and partition remapping in a multilevel context. In Wavefront Diffusion, the flow of vertices move in a wavefront from overbalanced to underbalanced domains. We present experimental evaluations of our LMSR and Wavefront Diffusion algorithms on synthetically generated adaptive meshes as well as on some application meshes. We show that our LMSR algorithm decreases the amount of vertex migration required to balance the graph and produces repartitionings of similar quality compared to state-of-the-art scratch-remap schemes. Furthermore, we show that our LMSR algorithm is more scalable in terms of execution time compared to state-of-the-art scratch-remap schemes. We show that our Wavefront Diffusion algorithm obtains significantly lower vertex migration requirements, while maintaining similar edge-cut results compared to state-of-the-art multilevel diffusion algorithms, especially for highly imbalanced graphs. Furthermore, we compare Wavefront Diffusion with LMSR and show that the former will result in lower vertex migration requirements and the later will result in higher quality edge-cut results. These results hold true regardless of the distance which diffusion is required to propagate in order to balance the graph. Finally, we discuss the run times of our schemes which are both capable of repartitioning an eight million node graph in under three seconds on a 128-processor Cray T3E.