Towards a query optimizer for text-centric tasks
Reads0
Chats0
TLDR
This article presents fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way, and adapts results from random-graph theory and statistics to develop a rigorous cost model for the execution plans.Abstract:
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.read more
Citations
More filters
Journal ArticleDOI
Information Extraction
TL;DR: A taxonomy of the field is created along various dimensions derived from the nature of the extraction task, the techniques used for extraction, the variety of input resources exploited, and the type of output produced to survey techniques for optimizing the various steps in an information extraction pipeline.
Word Frequency Distributions
TL;DR: This paper presents a meta-modelling framework for estimating the randomness of word frequency distributions using a variety of non-parametric and Parametric models.
Journal Article
ACM Transactions on Database Systems
Dan Suciu,Gerhard Weikum +1 more
TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Proceedings ArticleDOI
From information to knowledge: harvesting entities and relationships from web sources
Gerhard Weikum,Martin Theobald +1 more
TL;DR: This tutorial discusses state-of-the-art methods, research opportunities, and open challenges along this avenue of knowledge harvesting, to automatically construct and maintain a comprehensive knowledge base of facts about named entities, their semantic classes, and their mutual relations as well as temporal contexts, with high precision and high recall.
Journal ArticleDOI
The YAGO-NAGA approach to knowledge discovery
TL;DR: The architecture of the YAGO extractor toolkit, its distinctive approach to consistency checking, its provisions for maintenance and further growth, and the query engine for YAGA, coined NAGA are presented.
References
More filters
Book
The Nature of Statistical Learning Theory
TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Statistical learning theory
TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.
Journal ArticleDOI
Pattern Classification and Scene Analysis.
Book
Pattern classification and scene analysis
Richard O. Duda,Peter E. Hart +1 more
TL;DR: In this article, a unified, comprehensive and up-to-date treatment of both statistical and descriptive methods for pattern recognition is provided, including Bayesian decision theory, supervised and unsupervised learning, nonparametric techniques, discriminant analysis, clustering, preprosessing of pictorial data, spatial filtering, shape description techniques, perspective transformations, projective invariants, linguistic procedures, and artificial intelligence techniques for scene analysis.
Related Papers (5)
Snowball: extracting relations from large plain-text collections
Eugene Agichtein,Luis Gravano +1 more
UIMA: an architectural approach to unstructured information processing in the corporate research environment
David A. Ferrucci,Adam Lally +1 more