scispace - formally typeset
Search or ask a question

Showing papers in "ACM Computing Surveys in 2009"


Journal ArticleDOI
TL;DR: This survey tries to provide a structured and comprehensive overview of the research on anomaly detection by grouping existing techniques into different categories based on the underlying approach adopted by each technique.
Abstract: Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

9,627 citations


Journal ArticleDOI
TL;DR: This work introduces the reader to the motivations for solving the ambiguity of words and provides a description of the task, and overviews supervised, unsupervised, and knowledge-based approaches.
Abstract: Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. WSD is considered an AI-complete problem, that is, a task whose solution is at least as hard as the most difficult problems in artificial intelligence. We introduce the reader to the motivations for solving the ambiguity of words and provide a description of the task. We overview supervised, unsupervised, and knowledge-based approaches. The assessment of WSD systems is discussed in the context of the Senseval/Semeval campaigns, aiming at the objective evaluation of systems participating in several different disambiguation tasks. Finally, applications, open problems, and future directions are discussed.

2,178 citations


Journal ArticleDOI
TL;DR: This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data Fusion.
Abstract: The development of the Internet in recent years has made it possible and useful to access many different information systems anywhere in the world to obtain information. While there is much research on the integration of heterogeneous information systems, most commercial systems stop short of the actual integration of available data. Data fusion is the process of fusing multiple records representing the same real-world object into a single, consistent, and clean representation.This article places data fusion into the greater context of data integration, precisely defines the goals of data fusion, namely, complete, concise, and consistent data, and highlights the challenges of data fusion, namely, uncertain and conflicting data values. We give an overview and classification of different ways of fusing data and present several techniques based on standard and advanced operators of the relational algebra and SQL. Finally, the article features a comprehensive survey of data integration systems from academia and industry, showing if and how data fusion is performed in each.

1,797 citations


Journal ArticleDOI
TL;DR: Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, thetypes of information systems addressed by each methodology.
Abstract: The literature provides a wide range of techniques to assess and improve the quality of data. Due to the diversity and complexity of these techniques, research has recently focused on defining methodologies that help the selection, customization, and application of data quality assessment and improvement techniques. The goal of this article is to provide a systematic and comparative description of such methodologies. Methodologies are compared along several dimensions, including the methodological phases and steps, the strategies and techniques, the data quality dimensions, the types of data, and, finally, the types of information systems addressed by each methodology. The article concludes with a summary description of each methodology.

1,048 citations


Journal ArticleDOI
TL;DR: This work contributes to understanding which design components of reputation systems are most vulnerable, what are the most appropriate defense mechanisms and how these defense mechanisms can be integrated into existing or future reputation systems to make them resilient to attacks.
Abstract: Reputation systems provide mechanisms to produce a metric encapsulating reputation for a given domain for each identity within the system. These systems seek to generate an accurate assessment in the face of various factors including but not limited to unprecedented community size and potentially adversarial environments.We focus on attacks and defense mechanisms in reputation systems. We present an analysis framework that allows for the general decomposition of existing reputation systems. We classify attacks against reputation systems by identifying which system components and design choices are the targets of attacks. We survey defense mechanisms employed by existing reputation systems. Finally, we analyze several landmark systems in the peer-to-peer domain, characterizing their individual strengths and weaknesses. Our work contributes to understanding (1) which design components of reputation systems are most vulnerable, (2) what are the most appropriate defense mechanisms and (3) how these defense mechanisms can be integrated into existing or future reputation systems to make them resilient to attacks.

907 citations


Journal ArticleDOI
TL;DR: The aim is to provide a succinct summary of the state-of-the-art interface schemes, to illuminate both successful and unsuccessful interface strategies, and to identify potentially fruitful areas for further work.
Abstract: There are many interface schemes that allow users to work at, and move between, focused and contextual views of a dataset. We review and categorize these schemes according to the interface mechanisms used to separate and blend views. The four approaches are overview+detail, which uses a spatial separation between focused and contextual views; zooming, which uses a temporal separation; focus+context, which minimizes the seam between views by displaying the focus within the context; and cue-based techniques which selectively highlight or suppress items within the information space. Critical features of these categories, and empirical evidence of their success, are discussed. The aim is to provide a succinct summary of the state-of-the-art, to illuminate both successful and unsuccessful interface strategies, and to identify potentially fruitful areas for further work.

666 citations


Journal ArticleDOI
TL;DR: The state of the art in the industrial use of formal methods is described, concentrating on their increasing use at the earlier stages of specification and design, by comparing the situation in 2009 with the most significant surveys carried out over the last 20 years.
Abstract: Formal methods use mathematical models for analysis and verification at any part of the program life-cycle. We describe the state of the art in the industrial use of formal methods, concentrating on their increasing use at the earlier stages of specification and design. We do this by reporting on a new survey of industrial use, comparing the situation in 2009 with the most significant surveys carried out over the last 20 years. We describe some of the highlights of our survey by presenting a series of industrial projects, and we draw some observations from these surveys and records of experience. Based on this, we discuss the issues surrounding the industrial adoption of formal methods. Finally, we look to the future and describe the development of a Verified Software Repository, part of the worldwide Verified Software Initiative. We introduce the initial projects being used to populate the repository, and describe the challenges they address.

564 citations


Journal ArticleDOI
TL;DR: The generalization of meta-learning concepts to algorithms focused on tasks including sorting, forecasting, constraint satisfaction, and optimization, and the extension of these ideas to bioinformatics, cryptography, and other fields are discussed.
Abstract: The algorithm selection problem [Rice 1976] seeks to answer the question: Which algorithm is likely to perform best for my problemq Recognizing the problem as a learning task in the early 1990's, the machine learning community has developed the field of meta-learning, focused on learning about learning algorithm performance on classification problems. But there has been only limited generalization of these ideas beyond classification, and many related attempts have been made in other disciplines (such as AI and operations research) to tackle the algorithm selection problem in different ways, introducing different terminology, and overlooking the similarities of approaches. In this sense, there is much to be gained from a greater awareness of developments in meta-learning, and how these ideas can be generalized to learn about the behaviors of other (nonlearning) algorithms. In this article we present a unified framework for considering the algorithm selection problem as a learning problem, and use this framework to tie together the crossdisciplinary developments in tackling the algorithm selection problem. We discuss the generalization of meta-learning concepts to algorithms focused on tasks including sorting, forecasting, constraint satisfaction, and optimization, and the extension of these ideas to bioinformatics, cryptography, and other fields.

517 citations


Journal ArticleDOI
TL;DR: As work in Web page classification is reviewed, the importance of these Web-specific features and algorithms are noted, state-of-the-art practices are described, and the underlying assumptions behind the use of information from neighboring pages are tracked.
Abstract: Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process.As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.

502 citations


Journal ArticleDOI
TL;DR: The issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization are discussed, and the role played by the quality of the cluster labels is emphasized.
Abstract: Web clustering engines organize search results by topic, thus offering a complementary view to the flat-ranked list returned by conventional search engines. In this survey, we discuss the issues that must be addressed in the development of a Web clustering engine, including acquisition and preprocessing of search results, their clustering and visualization. Search results clustering, the core of the system, has specific requirements that cannot be addressed by classical clustering algorithms. We emphasize the role played by the quality of the cluster labels as opposed to optimizing only the clustering structure. We highlight the main characteristics of a number of existing Web clustering engines and also discuss how to evaluate their retrieval performance. Some directions for future research are finally presented.

414 citations


Journal ArticleDOI
TL;DR: A survey of recent progress in software model checking finds that the current state of the art in model checking is improving, but the pace of improvement is still slow.
Abstract: Software model checking is the algorithmic analysis of programs to prove properties of their executions. It traces its roots to logic and theorem proving, both to provide the conceptual framework in which to formalize the fundamental questions and to provide algorithmic procedures for the analysis of logical questions. The undecidability theorem [Turing 1936] ruled out the possibility of a sound and complete algorithmic solution for any sufficiently powerful programming model, and even under restrictions (such as finite state spaces), the correctness problem remained computationally intractable. However, just because a problem is hard does not mean it never appears in practice. Also, just because the general problem is undecidable does not imply that specific instances of the problem will also be hard. As the complexity of software systems grew, so did the need for some reasoning mechanism about correct behavior. (While we focus here on analyzing the behavior of

Journal ArticleDOI
TL;DR: The state of the art regarding ways in which the presence of a formal specification can be used to assist testing is reviewed.
Abstract: Formal methods and testing are two important approaches that assist in the development of high-quality software. While traditionally these approaches have been seen as rivals, in recent years a new consensus has developed in which they are seen as complementary. This article reviews the state of the art regarding ways in which the presence of a formal specification can be used to assist testing.

Journal ArticleDOI
TL;DR: This work reviews online communities research and proposes a sequence for incorporating success conditions during initiation and development to increase their chances of becoming a successful community, one in which members participate actively and develop lasting relationships.
Abstract: Using the information systems lifecycle as a unifying framework, we review online communities research and propose a sequence for incorporating success conditions during initiation and development to increase their chances of becoming a successful community, one in which members participate actively and develop lasting relationships. Online communities evolve following distinctive lifecycle stages and recommendations for success are more or less relevant depending on the developmental stage of the online community. In addition, the goal of the online community under study determines the components to include in the development of a successful online community. Online community builders and researchers will benefit from this review of the conditions that help online communities succeed.

Journal ArticleDOI
TL;DR: The previous research done to design, develop, and deploy systems for enabling private and anonymous communication on the Internet are surveyed, including mixes and mix networks, onion routing, and Dining Cryptographers networks are surveyed.
Abstract: The past two decades have seen a growing interest in methods for anonymous communication on the Internet, both from the academic community and the general public. Several system designs have been proposed in the literature, of which a number have been implemented and are used by diverse groups, such as journalists, human rights workers, the military, and ordinary citizens, to protect their identities on the Internet.In this work, we survey the previous research done to design, develop, and deploy systems for enabling private and anonymous communication on the Internet. We identify and describe the major concepts and technologies in the field, including mixes and mix networks, onion routing, and Dining Cryptographers networks. We will also review powerful traffic analysis attacks that have motivated improvements and variations on many of these anonymity protocols made since their introduction. Finally, we will summarize some of the major open problems in anonymous communication research and discuss possible directions for future work in the field.

Journal ArticleDOI
TL;DR: This survey gives an overview of formal results on the XML query language XPath and its fragments compared to other formalisms for querying trees, algorithms, and complexity bounds for evaluation ofXPath queries, as well as static analysis of XPath queries.
Abstract: This survey gives an overview of formal results on the XML query language XPath. We identify several important fragments of XPath, focusing on subsets of XPath 1.0. We then give results on the expressiveness of XPath and its fragments compared to other formalisms for querying trees, algorithms, and complexity bounds for evaluation of XPath queries, as well as static analysis of XPath queries.

Journal ArticleDOI
TL;DR: This work provides a survey of Internet geolocation technologies with an emphasis on adversarial contexts, and considers how this technology performs against a knowledgeable adversary whose goal is to evade geolocated techniques.
Abstract: Internet geolocation technology aims to determine the physical (geographic) location of Internet users and devices. It is currently proposed or in use for a wide variety of purposes, including targeted marketing, restricting digital content sales to authorized jurisdictions, and security applications such as reducing credit card fraud. This raises questions about the veracity of claims of accurate and reliable geolocation. We provide a survey of Internet geolocation technologies with an emphasis on adversarial contexts; that is, we consider how this technology performs against a knowledgeable adversary whose goal is to evade geolocation. We do so by examining first the limitations of existing techniques, and then, from this base, determining how best to evade existing geolocation techniques. We also consider two further geolocation techniques which may be of use even against adversarial targets: (1) the extraction of client IP addresses using functionality introduced in the 1.5 Java API, and (2) the collection of round-trip times using HTTP refreshes. These techniques illustrate that the seemingly straightforward technique of evading geolocation by relaying traffic through a proxy server (or network of proxy servers) is not as straightforward as many end-users might expect. We give a demonstration of this for users of the popular Tor anonymizing network.

Journal ArticleDOI
TL;DR: A survey of research into automated and semiautomated computer systems for expressive performance of music will examine the motivation for such systems and then examine the majority of the systems developed over the last 25 years.
Abstract: We present a survey of research into automated and semiautomated computer systems for expressive performance of music. We will examine the motivation for such systems and then examine the majority of the systems developed over the last 25 years. To highlight some of the possible future directions for new research, the review uses primary terms of reference based on four elements: testing status, expressive representation, polyphonic ability, and performance creativity.

Journal ArticleDOI
Nathan Brown1
TL;DR: The emphasis is placed on describing the general methods that are routinely applied in molecular discovery and in a context that provides for an easily accessible article for computer scientists as well as scientists from other numerate disciplines.
Abstract: Chemoinformatics is an interface science aimed primarily at discovering novel chemical entities that will ultimately result in the development of novel treatments for unmet medical needs, although these same methods are also applied in other fields that ultimately design new molecules. The field combines expertise from, among others, chemistry, biology, physics, biochemistry, statistics, mathematics, and computer science. In this general review of chemoinformatics the emphasis is placed on describing the general methods that are routinely applied in molecular discovery and in a context that provides for an easily accessible article for computer scientists as well as scientists from other numerate disciplines.

Journal ArticleDOI
TL;DR: The Verified Software Initiative will attempt to construct over the next fifteen years a comprehensive theory of programming that covers the features needed to build practical and reliable programs, a coherent toolset that automates the theory and scales up to the analysis of industrialstrength software.
Abstract: We propose an ambitious and long-term research program toward the construction of error-free software systems. Our manifesto represents a consensus position that has emerged from a series of national and international meetings, workshops, and conferences held from 2004 to 2007. The research project, the Verified Software Initiative, will attempt to construct over the next fifteen years: (1) a comprehensive theory of programming that covers the features needed to build practical and reliable programs, (2) a coherent toolset that automates the theory and scales up to the analysis of industrialstrength software, and (3) a collection of realistic verified programs that could replace unverified programs in current service and continue to evolve in a verified state. This document summarizes the background of the initiative, its scientific goals, and the principles that underlie a worldwide collaboration to achieve them. We include an assessment of its strengths, weaknesses, threats and opportunities. A companion document will summarize a range of work packages, including developments in theory, tools, and experiments.

Journal ArticleDOI
TL;DR: In this article, a survey and analysis conducted in light of these challenging requirements and constraints is presented, which involves techniques and strategies from work done in the areas of sensor fusion, sensor networks, smart sensing, Geographic Information Systems (GIS), photogrammetry, and other intelligent systems where finding optimal solutions to the placement and deployment of multimodal sensors covering a wide area is important.
Abstract: Although sensor planning in computer vision has been a subject of research for over two decades, a vast majority of the research seems to concentrate on two particular applications in a rather limited context of laboratory and industrial workbenches, namely 3D object reconstruction and robotic arm manipulation. Recently, increasing interest is engaged in research to come up with solutions that provide wide-area autonomous surveillance systems for object characterization and situation awareness, which involves portable, wireless, and/or Internet connected radar, digital video, and/or infrared sensors. The prominent research problems associated with multisensor integration for wide-area surveillance are modality selection, sensor planning, data fusion, and data exchange (communication) among multiple sensors. Thus, the requirements and constraints to be addressed include far-field view, wide coverage, high resolution, cooperative sensors, adaptive sensing modalities, dynamic objects, and uncontrolled environments. This article summarizes a new survey and analysis conducted in light of these challenging requirements and constraints. It involves techniques and strategies from work done in the areas of sensor fusion, sensor networks, smart sensing, Geographic Information Systems (GIS), photogrammetry, and other intelligent systems where finding optimal solutions to the placement and deployment of multimodal sensors covering a wide area is important. While techniques covered in this survey are applicable to many wide-area environments such as traffic monitoring, airport terminal surveillance, parking lot surveillance, etc., our examples will be drawn mainly from such applications as harbor security and long-range face recognition.

Journal ArticleDOI
TL;DR: This article gives a self-contained, contemporary presentation of Schaefer's theorem on Boolean constraint satisfaction, the inaugural result of this area, as well as analogs of this theorem for quantified formulas.
Abstract: An emerging area of research studies the complexity of constraint satisfaction problems under restricted constraint languages. This article gives a self-contained, contemporary presentation of Schaefer's theorem on Boolean constraint satisfaction, the inaugural result of this area, as well as analogs of this theorem for quantified formulas. Our exposition makes use of and may serve as an introduction to logical and algebraic tools that have recently come into focus.

Journal ArticleDOI
Gary H. Sockut1, Balakrishna R. Iyer1
TL;DR: This article is a tutorial and survey on requirements, issues, and strategies for online reorganization, which analyzes the issues and presents the strategies, which use the issues.
Abstract: In practice, any database management system sometimes needs reorganization, that is, a change in some aspect of the logical and/or physical arrangement of a database. In traditional practice, many types of reorganization have required denying access to a database (taking the database offline) during reorganization. Taking a database offline can be unacceptable for a highly available (24-hour) database, for example, a database serving electronic commerce or armed forces, or for a very large database. A solution is to reorganize online (concurrently with usage of the database, incrementally during users' activities, or interpretively). This article is a tutorial and survey on requirements, issues, and strategies for online reorganization. It analyzes the issues and then presents the strategies, which use the issues. The issues, most of which involve design trade-offs, include use of partitions, the locus of control for the process that reorganizes (a background process or users' activities), reorganization by copying to newly allocated storage (as opposed to reorganizing in place), use of differential files, references to data that has moved, performance, and activation of reorganization. The article surveys online strategies in three categories of reorganization. The first category, maintenance, involves restoring the physical arrangement of data instances without changing the database definition. This category includes restoration of clustering, reorganization of an index, rebalancing of parallel or distributed data, garbage collection for persistent storage, and cleaning (reclamation of space) in a log-structured file system. The second category involves changing the physical database definition; topics include construction of indexes, conversion between B+ -trees and linear hash files, and redefinition (e.g., splitting) of partitions. The third category involves changing the logical database definition. Some examples are changing a column's data type, changing the inheritance hierarchy of object classes, and changing a relationship from one-to-many to many-to-many. The survey encompasses both research and commercial implementations, and this article points out several open research topics. As highly available or very large databases continue to become more common and more important in the world economy, the importance of online reorganization is likely to continue growing.

Journal ArticleDOI
TL;DR: Some of the basic deduction techniques used in software and hardware verification are introduced and the theoretical and engineering issues in building deductive verification tools are outlined.
Abstract: Automated deduction uses computation to perform symbolic logical reasoning. It has been a core technology for program verification from the very beginning. Satisfiability solvers for propositional and first-order logic significantly automate the task of deductive program verification. We introduce some of the basic deduction techniques used in software and hardware verification and outline the theoretical and engineering issues in building deductive verification tools. Beyond verification, deduction techniques can also be used to support a variety of applications including planning, program optimization, and program synthesis.

Journal ArticleDOI
TL;DR: This survey focuses on multiparty scenarios and provides a comprehensive overview of fundamental issues on nonrepudiation, including the types of non repudiation service and cryptographic evidence, the roles of trusted third-party, nonrepUDiation phases and requirements, and the status of standardization.
Abstract: Nonrepudiation is a security service that plays an important role in many Internet applications. Traditional two-party nonrepudiation has been studied intensively in the literature. This survey focuses on multiparty scenarios and provides a comprehensive overview. It starts with a brief introduction of fundamental issues on nonrepudiation, including the types of nonrepudiation service and cryptographic evidence, the roles of trusted third-party, nonrepudiation phases and requirements, and the status of standardization. Then it describes the general multiparty nonrepudiation problem, and analyzes state-of-the-art mechanisms. After this, it presents in more detail the 1-N multiparty nonrepudiation solutions for distribution of different messages to multiple recipients. Finally, it discusses advanced solutions for two typical multiparty nonrepudiation applications, namely, multiparty certified email and multiparty contract signing.

Journal ArticleDOI
TL;DR: This article first describes some basic conceptual notions regarding the design of such fast algorithms, and then the coverage proceeds through several recursive graph classes, which include trees, series-parallel graphs, and treewidth-k graphs.
Abstract: Fast algorithms can be created for many graph problems when instances are confined to classes of graphs that are recursively constructed. This article first describes some basic conceptual notions regarding the design of such fast algorithms, and then the coverage proceeds through several recursive graph classes. Specific classes include trees, series-parallel graphs, k-terminal graphs, treewidth-k graphs, k-trees, partial k-trees, k-jackknife graphs, pathwidth-k graphs, bandwidth-k graphs, cutwidth-k graphs, branchwidth-k graphs, Halin graphs, cographs, cliquewidth-k graphs, k-NLC graphs, k-HB graphs, and rankwidth-k graphs. The definition of each class is provided. Typical algorithms are applied to solve problems on instances of most classes. Relationships between the classes are also discussed.

Journal ArticleDOI
TL;DR: Mature proof tools, both automatic and interactive, are now providing indispensable aid in computing research, including research into verification, for mechanized proof of classical conjectures in mathematics, computers have become an indispensable tool.
Abstract: The origins of software verification go back to the pioneers of Computing Science, von Neumann and Turing. The idea has been rediscovered several times since then, for example by McCarthy, Naur and Floyd. The ideals of verification have inspired half a century of productive computing research at the foundations of the subject. There are now flourishing research schools in computational logic, computer-aided proof, programming theory, formal semantics, specification and programming languages, programming methodology and software engineering. By the end of the last century, enormous progress had been made in verification theory and in tools to assist in its application. The technology of proof was extended to include constraint solving and model checking, which were routinely exploited in the electronics industry to increase confidence in the absence of errors in circuit designs before commitment to silicon. Programming theory and semantics provided logics for proof of correctness of well-structured sequential programs. The foundations of concurrent programming were explored by employing temporal logic, and communication over channels was explored in a number of process algebras. Formal specifications were used in certain safety-critical applications as an aid to system development and verification of correctness. Internal program specifications in the form of program assertions were used in the software industry as test oracles, to detect and diagnose errors in regression tests conducted overnight. In suitable cases they are left in customer code for re-checking at run time. The early years of the current century have seen a dramatic spurt in progress towards realization of the ideal of verification of software as well as hardware. Proof technology is now routinely exploited in industrially supported program analysis tools, which successfully detect many kinds of generic program error even before a program is tested. Mature proof tools, both automatic and interactive, are now providing indispensable aid in computing research, including research into verification. For mechanized proof of classical conjectures in mathematics, computers have become an indispensable tool.

Journal ArticleDOI
TL;DR: The article “Temporal Logics for Real-Time System Specification” surveys some of the relevant literature dealing with the use of temporal logics for the specification of real-time systems, but introduces some imprecisions that might create some confusion in the reader.
Abstract: The article “Temporal Logics for Real-Time System Specification” surveys some of the relevant literature dealing with the use of temporal logics for the specification of real-time systems. Unfortunately, it introduces some imprecisions that might create some confusion in the reader. While a certain degree of informality is certainly useful when addressing a broad audience, imprecisions can negatively impact the legibility of the exposition. We clarify some of its remarks on a few topics, in an effort to contribute to the usefulness of the survey for the reader.