scispace - formally typeset
Search or ask a question

Showing papers by "Christian Schallhart published in 2012"


Proceedings ArticleDOI
16 Apr 2012
TL;DR: In this demonstration, it is demonstrated with a first prototype of DIADEM that, in contrast to alchemists, DIADem has developed a viable formula for transforming unstructured web information into highly structured data with near perfect accuracy.
Abstract: Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.

46 citations


Book ChapterDOI
11 Nov 2012
TL;DR: This work applies deqa, a conceptual framework that achieves this elusive goal through combining state-of-the-art semantic technologies with effective data extraction, to the UK real estate domain and shows that it can answer a significant percentage of such questions correctly.
Abstract: Despite decades of effort, intelligent object search remains elusive. Neither search engine nor semantic web technologies alone have managed to provide usable systems for simple questions such as "find me a flat with a garden and more than two bedrooms near a supermarket." We introduce deqa, a conceptual framework that achieves this elusive goal through combining state-of-the-art semantic technologies with effective data extraction. To that end, we apply deqa, to the UK real estate domain and show that it can answer a significant percentage of such questions correctly. deqa achieves this by mapping natural language questions to Sparql patterns. These patterns are then evaluated on an RDF database of current real estate offers. The offers are obtained using OXPath, a state-of-the-art data extraction system, on the major agencies in the Oxford area and linked through Limes to background knowledge such as the location of supermarkets.

44 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: OPAL is presented, the first comprehensive approach to form understanding and combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns.
Abstract: Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).

29 citations


Posted Content
TL;DR: In this paper, a Datalog-based template language is used to define a domain ontology of forms in a given domain, which is then used for form understanding and integration.
Abstract: Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding provides applications, ranging from crawlers over meta-search engines to service integrators, with a key to this content. Yet, it has received little attention other than as component in specific applications such as crawlers or meta-search engines. No comprehensive approach to form understanding exists, let alone one that produces rich models for semantic services or integration with linked open data. In this paper, we present OPAL, the first comprehensive approach to form understanding and integration. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines features from the text, structure, and visual rendering of a web page. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches for form labeling by a significant margin. For form interpretation, OPAL uses a schema (or ontology) of forms in a given domain. Thanks to this domain schema, it is able to produce nearly perfect (more than 97 percent accuracy in the evaluation domains) form interpretations. Yet, the effort to produce a domain schema is very low, as we provide a Datalog-based template language that eases the specification of such schemata and a methodology for deriving a domain schema largely automatically from an existing domain ontology. We demonstrate the value of the form interpretations in OPAL through a light-weight form integration system that successfully translates and distributes master queries to hundreds of forms with no error, yet is implemented with only a handful translation rules.

20 citations


Book ChapterDOI
23 Jul 2012
TL;DR: A novel framework for web block classification that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification is obtained that is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed.
Abstract: Content-intensive web sites, such as Google or Amazon, paginate their results to accommodate limited screen sizes. Thus, human users and automatic tools alike have to traverse the pagination links when they crawl the site, extract data, or automate common tasks, where these applications require access to the entire result set. Previous approaches, as well as existing crawlers and automation tools, rely on simple heuristics (e.g., considering only the link text), falling back to an exhaustive exploration of the site where those heuristics fail. In particular, focused crawlers and data extraction systems target only fractions of the individual pages of a given site, rendering a highly accurate identification of pagination links essential to avoid the exhaustive exploration of irrelevant pages. We identify pagination links in a wide range of domains and sites with near perfect accuracy (99%). We obtain these results with a novel framework for web block classification, ${\textsc{ber}_y{\textsc l}}$, that combines rule-based reasoning for feature extraction and machine learning for feature selection and classification. Through this combination, ${\textsc{ber}_y{\textsc l}}$ is applicable in a wide settings range, adjusted to maximise either precision, recall, or speed. We illustrate how ${\textsc{ber}_y{\textsc l}}$ minimises the effort for feature extraction and evaluate the impact of a broad range of features (content, structural, and visual).

14 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement, and offers a list of wrappers ranked by example similarity and robustness.
Abstract: Good examples are hard to find, particularly in wrapper induction: Picking even one wrong example can spell disaster by yielding overgeneralized or overspecialized wrappers. Such wrappers extract data with low precision or recall, unless adjusted by human experts at significant cost.Visual OXPath is an open-source, visual wrapper induction system that requires minimal examples and eases wrapper refinement: Often it derives the intended wrapper from a single example through sophisticated heuristics that determine the best set of similar examples. To ease wrapper refinement, it offers a list of wrappers ranked by example similarity and robustness. Visual OXPath offers extensive visual feedback for this refinement which can be performed without any knowledge of the underlying wrapper language. Where further refinement by a human wrapper is needed, Visual OXPath profits from being based on OXPath, a declarative wrapper language that extends XPath with a thin layer of features necessary for extraction and page navigation.

12 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: A novel form understanding approach, OPAL, is used to assist in form filling even for complex, previously unknown forms, and achieves >99% accuracy in form understanding.
Abstract: Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.

11 citations


Posted Content
TL;DR: The extraction of multi-attribute objects from the deep web is the bridge between the unstructured web and structured data and AMBER overcomes both limitations through mutual supervision between the repeated structure and automatically produced annotations.
Abstract: The extraction of multi-attribute objects from the deep web is the bridge between the unstructured web and structured data. Existing approaches either induce wrappers from a set of human-annotated pages or leverage repeated structures on the page without supervision. What the former lack in automation, the latter lack in accuracy. Thus accurate, automatic multi-attribute object extraction has remained an open challenge. AMBER overcomes both limitations through mutual supervision between the repeated structure and automatically produced annotations. Previous approaches based on automatic annotations have suffered from low quality due to the inherent noise in the annotations and have attempted to compensate by exploring multiple candidate wrappers. In contrast, AMBER compensates for this noise by integrating repeated structure analysis with annotation-based induction: The repeated structure limits the search space for wrapper induction, and conversely, annotations allow the repeated structure analysis to distinguish noise from relevant data. Both, low recall and low precision in the annotations are mitigated to achieve almost human quality (more than 98 percent) multi-attribute object extraction. To achieve this accuracy, AMBER needs to be trained once for an entire domain. AMBER bootstraps its training from a small, possibly noisy set of attribute instances and a few unannotated sites of the domain.

9 citations


Proceedings ArticleDOI
16 Apr 2012
TL;DR: This work shows how AMBER uses the repeated structure of records on deep web result pages to learn domain gazetteers from a small seed set, and shows how this is only possible with a highly accurate extraction system.
Abstract: Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the $4th$ iteration.

8 citations


Proceedings Article
01 Jan 2012
TL;DR: FShell is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria and specifies coverage of ERROR labels to solve the reachability problem posed in SV-COMP.
Abstract: FShell is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria. The pillars of FShell are the declarative specification language FQL (FShell Query Language), an efficient back end for com- puting test data, and a mathematical framework to reason about cover- age criteria. To solve the reachability problem posed in SV-COMP we specify coverage of ERROR labels. As back end, FShell uses bounded model checking, building upon components of CBMC and leveraging the power of SAT solvers for efficient enumeration of a full test suite.

7 citations


Book ChapterDOI
24 Mar 2012
TL;DR: FShell as discussed by the authors is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria, and using bounded model checking to solve the reachability problem posed in SV-COMP.
Abstract: FShell is an automated white-box test-input generator for C programs, computing test data with respect to user-specified code coverage criteria. The pillars of FShell are the declarative specification language FQL (FShell Query Language), an efficient back end for computing test data, and a mathematical framework to reason about coverage criteria. To solve the reachability problem posed in SV-COMP we specify coverage of ERROR labels. As back end, FShell uses bounded model checking, building upon components of CBMC and leveraging the power of SAT solvers for efficient enumeration of a full test suite.

Proceedings Article
08 Jul 2012
TL;DR: This work proposes a novel method for extending minimal seed lists into complete gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations and easily outperform previous approaches on named entity recognition.
Abstract: Key to named entity recognition, the manual gazetteering of entity lists is a costly, errorprone process that often yields results that are incomplete and suffer from sampling bias. Exploiting current sources of structured information, we propose a novel method for extending minimal seed lists into complete gazetteers. Like previous approaches, we value Wikipedia as a huge, well-curated, and relatively unbiased source of entities. However, in contrast to previous work, we exploit not only its content, but also its structure, as exposed in DBPedia. We extend gazetteers through Wikipedia categories, carefully limiting the impact of noisy categorizations. The resulting gazetteers easily outperform previous approaches on named entity recognition.

Book ChapterDOI
03 Sep 2012
TL;DR: All real estate offers, all airline flights, or all your local restaurants’ menus automatically collected from hundreds or thousands of agencies, travel agencies, or restaurants, presented as a single homogeneous dataset.
Abstract: What if you could turn all websites of an entire domain into a single database? Imagine all real estate offers, all airline flights, or all your local restaurants’ menus automatically collected from hundreds or thousands of agencies, travel agencies, or restaurants, presented as a single homogeneous dataset.

01 Jan 2012
TL;DR: It is demonstrated that static and adaptive optimisation techniques, as used for query languages, significantly improve the wrapper execution performance and can easily cut wrapper evaluation time by one order of magnitude.
Abstract: Web wrappers access databases hidden in the deep web by first interacting with web sites by, e.g., filling forms or clicking buttons, to extract the relevant data from the thus unearthed result pages. Though the (semi-)automatic induction and maintenance of such wrappers has been extensively studied, the efficient execution and optimization of wrappers has seen far less attention. We demonstrate that static and adaptive optimisation techniques, as used for query languages, significantly improve the wrapper execution performance. At the same time, we highlight difference between wrapper optimisation and common query optimisation for databases: (1) The runtime of wrappers is entirely dominated by page loads, while other operations (such as querying DOMs) have almost no impact, requiring a new cost model to guide the optimisation. (2) While adaptive query planning is otherwise often considered inessential, wrappers need to be optimised during runtime, since crucial information on the structure of the visited pages becomes only accessible at runtime. We introduce two basic, but highly effective optimisation techniques, one static, one adaptive, and show that they can easily cut wrapper evaluation time by one order of magnitude. We demonstrate our approach with wrappers specified in OXPath.