scispace - formally typeset
Search or ask a question

Showing papers by "Saptarsi Goswami published in 2011"


Proceedings ArticleDOI
08 Apr 2011
TL;DR: Outlier detection technique is utilized to find the processes varying most from the group in terms of execution trace, and among the 5 outlier cluster, 2 of them are genuine concern, while the other 3 figure out because of the huge number of rows involved.
Abstract: Extract, Transform, Load (ETL) is an integral part of Data Warehousing (DW) implementation. The commercial tools that are used for this purpose captures lot of execution trace in form of various log files with plethora of information. However there has been hardly any initiative where any proactive analyses have been done on the ETL logs to improve their efficiency. In this paper we utilize outlier detection technique to find the processes varying most from the group in terms of execution trace. As our experiment was carried on actual production processes, any outlier we would consider as a signal rather than a noise. To identify the input parameters for the outlier detection algorithm we employ a survey among developer community with varied mix of experience & expertise. We use simple text parsing to extract these features from the logs, as shortlisted from the survey. Subsequently we applied outlier detection technique (Clustering based) on the logs. By this process we reduced our domain of detailed analysis from 500 logs to 44 logs (8 Percentage). Among the 5 outlier cluster, 2 of them are genuine concern, while the other 3 figure out because of the huge number of rows involved.

7 citations


Journal ArticleDOI
TL;DR: In this paper, the authors explore four outlier detection techniques over a rich corpora of production queries and analyze the results, and also explore the benefit of an ensemble approach for optimization of extraction, transform, load (ETL).
Abstract: S is the heart for both OLTP and OLAP types of applications. For both types of applications thousands of queries expressed in terms of SQL are executed on daily basis. All the commercial DBM S engines capture various attributes in system tables about these executed queries. These queries need to conform to best practices and need to be tuned to ensure optimal performance. While we use checklists, often tools to enforce the same, a black box technique on the queries for profiling, outlier detection is not employed for a summary level understanding. This is the motivation of the paper, as this not only points out to inefficiencies built in the system, but also has the potential to point evolving best practices and inappropriate usage. Certainly this can reduce latency in information flow and optimal utilization of hardware and software capacity. In this paper we start with formulating the problem. We explore four outlier detection techniques. We apply these techniques over rich corpora of production queries and analyze the results. We also explore benefit of an ensemble approach. We conclude with future courses of action. The same philosophy we have used for optimization of extraction, transform, load (ETL) jobs in one of our previous work. We give a brief introduction of the same in section four.

5 citations