scispace - formally typeset
Search or ask a question
Author

Jun Yang

Bio: Jun Yang is an academic researcher from Duke University. The author has contributed to research in topics: Tuple & Wireless sensor network. The author has an hindex of 37, co-authored 167 publications receiving 5195 citations. Previous affiliations of Jun Yang include University of California, Berkeley & Durham University.


Papers
More filters
Proceedings ArticleDOI
11 Jun 2007
TL;DR: BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search, and offers orders-of-magnitude performance improvement over existing approaches.
Abstract: Query processing over graph-structured data is enjoying a growing number of applications. A top-k keyword search query on a graph finds the top k answers according to some ranking criteria, where each answer is a substructure of the graph containing all query keywords. Current techniques for supporting such queries on general graphs suffer from several drawbacks, e.g., poor worst-case performance, not taking full advantage of indexes, and high memory requirements. To address these problems, we propose BLINKS, a bi-level indexing and query processing scheme for top-k keyword search on graphs. BLINKS follows a search strategy with provable performance bounds, while additionally exploiting a bi-level index for pruning and accelerating the search. To reduce the index space, BLINKS partitions a data graph into blocks: The bi-level index stores summary information at the block level to initiate and guide search among blocks, and more detailed information for each block to accelerate search within blocks. Our experiments show that BLINKS offers orders-of-magnitude performance improvement over existing approaches.

601 citations

Proceedings Article
25 Aug 1997
TL;DR: This work presents the design of a query optimizer for Garlic, a middleware system designed to integrate data from a broad range of data sources with very different query capabilities, and describes the design and implementation.
Abstract: Businessestoday need to interrelate data stored in diverse systems with differing capabilities, ideally via a single high-level query interface. We present the design of a query optimizer for Garlic [C 95], a middleware system designedto integrate data from a broad range of data sources with very different query capabilities. Garlic’s optimizer extends the rule-based approach of [Loh88] to work in a heterogeneous environment, by defining generic rules for the middleware and using wrapper-provided rules to encapsulate the capabilities of each data source. This approach offers great advantages in terms of plan quality, extensibility to new sources, incremental implementationof rules for new sources, and the ability to express the capabilities of a diverse set of sources. We describe the design and implementationof this optimizer, and illustrate its actions through an example.

537 citations

Proceedings ArticleDOI
03 Apr 2006
TL;DR: This paper proposes a novel labeling scheme for sparse graphs that ensures that graph reachability queries can be answered in constant time, and provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.
Abstract: Graph reachability is fundamental to a wide range of applications, including XML indexing, geographic navigation, Internet routing, ontology queries based on RDF/OWL, etc. Many applications involve huge graphs and require fast answering of reachability queries. Several reachability labeling methods have been proposed for this purpose. They assign labels to the vertices, such that the reachability between any two vertices may be decided using their labels only. For sparse graphs, 2-hop based reachability labeling schemes answer reachability queries efficiently using relatively small label space. However, the labeling process itself is often too time consuming to be practical for large graphs. In this paper, we propose a novel labeling scheme for sparse graphs. Our scheme ensures that graph reachability queries can be answered in constant time. Furthermore, for sparse graphs, the complexity of the labeling process is almost linear, which makes our algorithm applicable to massive datasets. Analytical and experimental results show that our approach is much more efficient than stateof- the-art approaches. Furthermore, our labeling method also provides an alternative scheme to tradeoff query time for label space, which further benefits applications that use tree-like graphs.

258 citations

Proceedings ArticleDOI
27 Jun 2006
TL;DR: This work adds enhancements to CONCH to build in redundant constraints and provide a method to interpret the resulting reports in case of uncertainty, and experimentally evaluates CONCH's effectiveness against competing schemes in a number of interesting scenarios.
Abstract: Wireless sensor networks have created new opportunities for data collection in a variety of scenarios, such as environmental and industrial, where we expect data to be temporally and spatially correlated. Researchers may want to continuously collect all sensor data from the network for later analysis. Suppression, both temporal and spatial, provides opportunities for reducing the energy cost of sensor data collection. We demonstrate how both types can be combined for maximal benefit. We frame the problem as one of monitoring node and edge constraints. A monitored node triggers a report if its value changes. A monitored edge triggers a report if the difference between its nodes' values changes. The set of reports collected at the base station is used to derive all node values. We fully exploit the potential of this global inference in our algorithm, CONCH, short for constraint chaining. Constraint chaining builds a network of constraints that are maintained locally, but allow a global view of values to be maintained with minimal cost. Network failure complicates the use of suppression, since either causes an absence of reports. We add enhancements to CONCH to build in redundant constraints and provide a method to interpret the resulting reports in case of uncertainty. Using simulation we experimentally evaluate CONCH's effectiveness against competing schemes in a number of interesting scenarios.

175 citations

Book
16 Nov 2012
TL;DR: This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.
Abstract: Materialized views are a natural embodiment of the ideas of precomputation and caching in databases. Instead of computing a query from scratch, a system can use results that have already been computed, stored, and kept in sync with database updates. The ability of materialized views to speed up queries benefits most database applications, ranging from traditional querying and reporting to web database caching, online analytical processing, and data mining. By reducing dependency on the availability of base data, materialized views have also laid much of the foundation for information integration and data warehousing systems. The database tradition of declarative querying distinguishes materialized views from generic applications of precomputation and caching in other contexts, and makes materialized views especially interesting, powerful, and challenging at the same time. Study of materialized views has generated a rich research literature and mature commercial implementations, aimed at providing efficient, effective, automated, and general solutions to the selection, use, and maintenance of materialized views. This monograph provides an accessible introduction and reference to materialized views, explains its core ideas, highlights its recent developments, and points out its sometimes subtle connections to other research topics in databases.

172 citations


Cited by
More filters
Proceedings ArticleDOI
03 Jun 2002
TL;DR: The need for and research issues arising from a new model of data processing, where data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams are motivated.
Abstract: In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

2,933 citations

Journal ArticleDOI
TL;DR: Wang et al. as discussed by the authors presented a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets.
Abstract: Social media for news consumption is a double-edged sword. On the one hand, its low cost, easy access, and rapid dissemination of information lead people to seek out and consume news from social media. On the other hand, it enables the wide spread of \fake news", i.e., low quality news with intentionally false information. The extensive spread of fake news has the potential for extremely negative impacts on individuals and society. Therefore, fake news detection on social media has recently become an emerging research that is attracting tremendous attention. Fake news detection on social media presents unique characteristics and challenges that make existing detection algorithms from traditional news media ine ective or not applicable. First, fake news is intentionally written to mislead readers to believe false information, which makes it difficult and nontrivial to detect based on news content; therefore, we need to include auxiliary information, such as user social engagements on social media, to help make a determination. Second, exploiting this auxiliary information is challenging in and of itself as users' social engagements with fake news produce data that is big, incomplete, unstructured, and noisy. Because the issue of fake news detection on social media is both challenging and relevant, we conducted this survey to further facilitate research on the problem. In this survey, we present a comprehensive review of detecting fake news on social media, including fake news characterizations on psychology and social theories, existing algorithms from a data mining perspective, evaluation metrics and representative datasets. We also discuss related research areas, open problems, and future research directions for fake news detection on social media.

1,891 citations

Book
02 Jan 1991

1,377 citations

Proceedings Article
01 Jan 2003
TL;DR: The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams and leverages the PostgreSQL open source code base.
Abstract: Increasingly pervasive networks are leading towards a world where data is constantly in motion. In such a world, conventional techniques for query processing, which were developed under the assumption of a far more static and predictable computational environment, will not be sufficient. Instead, query processors based on adaptive dataflow will be necessary. The Telegraph project has developed a suite of novel technologies for continuously adaptive query processing. The next generation Telegraph system, called TelegraphCQ, is focused on meeting the challenges that arise in handling large streams of continuous queries over high-volume, highly-variable data streams. In this paper, we describe the system architecture and its underlying technology, and report on our ongoing implementation effort, which leverages the PostgreSQL open source code base. We also discuss open issues and our research agenda.

1,248 citations

Journal ArticleDOI
TL;DR: The paper presents the “textbook” architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems, and discusses different kinds of distributed systems such as client-server, middleware (multitier), and heterogeneous database systems and shows how query processing works in these systems.
Abstract: Distributed data processing is becoming a reality. Businesses want to do it for many reasons, and they often must do it in order to stay competitive. While much of the infrastructure for distributed data processing is already there (e.g., modern network technology), a number of issues make distributed data processing still a complex undertaking: (1) distributed systems can become very large, involving thousands of heterogeneous sites including PCs and mainframe server machines; (2) the state of a distributed system changes rapidly because the load of sites varies over time and new sites are added to the system; (3) legacy systems need to be integrated—such legacy systems usually have not been designed for distributed data processing and now need to interact with other (modern) systems in a distributed environment. This paper presents the state of the art of query processing for distributed database and information systems. The paper presents the “textbook” architecture for distributed query processing and a series of techniques that are particularly useful for distributed database systems. These techniques include special join techniques, techniques to exploit intraquery paralleli sm, techniques to reduce communication costs, and techniques to exploit caching and replication of data. Furthermore, the paper discusses different kinds of distributed systems such as client-server, middleware (multitier), and heterogeneous database systems, and shows how query processing works in these systems.

980 citations