scispace - formally typeset
Search or ask a question

Showing papers on "Semantic Web published in 2008"


Journal ArticleDOI
TL;DR: In a very significant development for eHealth, a broad adoption of Web 2.0 technologies and approaches coincides with the more recent emergence of Personal Health Application Platforms and Personally Controlled Health Records such as Google Health, Microsoft HealthVault, and Dossia.
Abstract: In a very significant development for eHealth, broad adoption of Web 2.0 technologies and approaches coincides with the more recent emergence of Personal Health Application Platforms and Personally Controlled Health Records such as Google Health, Microsoft HealthVault, and Dossia. "Medicine 2.0" applications, services and tools are defined as Web-based services for health care consumers, caregivers, patients, health professionals, and biomedical researchers, that use Web 2.0 technologies and/or semantic web and virtual reality approaches to enable and facilitate specifically 1) social networking, 2) participation, 3) apomediation, 4) openness and 5) collaboration, within and between these user groups. The Journal of Medical Internet Research (JMIR) publishes a Medicine 2.0 theme issue and sponsors a conference on "How Social Networking and Web 2.0 changes Health, Health Care, Medicine and Biomedical Research", to stimulate and encourage research in these five areas.

1,038 citations


01 Jan 2008
TL;DR: The OWL 2 Web Ontology Language, informally OWL2, is an ontology language for the Semantic Web with formally defined meaning.
Abstract: The OWL 2 Web Ontology Language, informally OWL 2, is an ontology language for the Semantic Web with formally defined meaning. OWL 2 ontologies provide classes, properties, individuals, and data values and are stored as Semantic Web documents. OWL 2 ontologies can be used along with information written in RDF, and OWL 2 ontologies themselves are primarily exchanged as RDF documents. The OWL 2 Document Overview describes the overall state of OWL 2, and should be read before other OWL 2 documents. The meaningful constructs provided by OWL 2 are defined in terms of their structure. As well, a functional-style syntax is defined for these constructs, with examples and informal descriptions. One can reason with OWL 2 ontologies under either the RDF-Based Semantics [OWL 2 RDF-Based Semantics] or the Direct Semantics [OWL 2 Direct Semantics]. If certain restrictions on OWL 2 ontologies are satisfied and the ontology is in OWL 2 DL, reasoning under the Direct Semantics can be implemented using techniques well known in the literature.

957 citations


Journal ArticleDOI
TL;DR: This paper proposes a class of applications called collective knowledge systems, which unlock the ''collective intelligence'' of the Social Web with knowledge representation and reasoning techniques of the Semantic Web.

802 citations


Journal ArticleDOI
TL;DR: The present article details this new approach to build mashups of bioinformatics data and illustrates the building of a mashup used to explore the implication of four transcription factor genes in Parkinson's disease.

800 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This paper proposes an RDF storage scheme that uses the triple nature of RDF as an asset, which confers significant advantages compared to previous approaches for RDF data management, at the price of a worst-case five-fold increase in index space.
Abstract: Despite the intense interest towards realizing the Semantic Web vision, most existing RDF data management schemes are constrained in terms of efficiency and scalability Still, the growing popularity of the RDF format arguably calls for an effort to offset these drawbacks Viewed from a relational-database perspective, these constraints are derived from the very nature of the RDF data model, which is based on a triple format Recent research has attempted to address these constraints using a vertical-partitioning approach, in which separate two-column tables are constructed for each property However, as we show, this approach suffers from similar scalability drawbacks on queries that are not bound by RDF property value In this paper, we propose an RDF storage scheme that uses the triple nature of RDF as an asset This scheme enhances the vertical partitioning idea and takes it to its logical conclusion RDF data is indexed in six possible ways, one for each possible ordering of the three RDF elements Each instance of an RDF element is associated with two vectors; each such vector gathers elements of one of the other types, along with lists of the third-type resources attached to each vector element Hence, a sextuple-indexing scheme emerges This format allows for quick and scalable general-purpose query processing; it confers significant advantages (up to five orders of magnitude) compared to previous approaches for RDF data management, at the price of a worst-case five-fold increase in index space We experimentally document the advantages of our approach on real-world and synthetic data sets with practical queries

684 citations


Journal ArticleDOI
TL;DR: In this paper, the semantic sensor web (SSW) proposes that sensor data be annotated with semantic metadata that will both increase interoperability and provide contextual information essential for situational knowledge.
Abstract: Sensors are distributed across the globe leading to an avalanche of data about our environment It is possible today to utilize networks of sensors to detect and identify a multitude of observations, from simple phenomena to complex events and situations The lack of integration and communication between these networks, however, often isolates important data streams and intensifies the existing problem of too much data and not enough knowledge With a view to addressing this problem, the semantic sensor Web (SSW) proposes that sensor data be annotated with semantic metadata that will both increase interoperability and provide contextual information essential for situational knowledge

658 citations


Journal ArticleDOI
TL;DR: How ontologies provide the semantics, as explained here with the help of Harry Potter and his owl Hedwig.
Abstract: How ontologies provide the semantics, as explained here with the help of Harry Potter and his owl Hedwig.

629 citations


Journal ArticleDOI
TL;DR: This paper gives an overview of approaches in this context to managing probabilistic uncertainty, possibilistic Uncertainty, and vagueness in expressive description logics for the Semantic Web.

522 citations


Book ChapterDOI
29 Sep 2008
TL;DR: This paper analyzes the complexity of product description on the Semantic Web and defines the GoodRelations ontology that covers the representational needs of typical business scenarios for commodity products and services.
Abstract: A promising application domain for Semantic Web technology is the annotation of products and services offerings on the Web so that consumers and enterprises can search for suitable suppliers using products and services ontologies. While there has been substantial progress in developing ontologies for typesof products and services, namely eClassOWL, this alone does not provide the representational means required for e-commerce on the Semantic Web. Particularly missing is an ontology that allows describing the relationships between (1) Web resources, (2) offerings made by means of those Web resources, (3) legal entities, (4) prices, (5) terms and conditions, and the aforementioned ontologies for products and services (6). For example, we must be able to say that a particular Web site describes an offer to sell cell phones of a certain make and model at a certain price, that a piano house offers maintenance for pianos that weigh less than 150 kg, or that a car rental company leases out cars of a certain make and model from a set of branches across the country. In this paper, we analyze the complexity of product description on the Semantic Web and define the GoodRelations ontology that covers the representational needs of typical business scenarios for commodity products and services.

403 citations


01 Jan 2008
TL;DR: This tutorial will provide participants with a solid foundation from which to begin publishing Linked Data on the Web, as well as to implement applications that consume Linked data from the Web.
Abstract: The Web is increasingly understood as a global information space consisting not just of linked documents, but also of Linked Data. The Linked Data principles provide a basis for realizing this Web of Data, or Semantic Web. Since early 2007 numerous data sets have been published on the Web according to these principles, in domains as broad as music, books, geographical information, films, people, events, reviews and photos. In combination these data sets consist of over 2 billion RDF triples, interlinked by more than 3 million triples that cross data sets. As this Web of Linked Data continues to grow, and an increasing number of applications are developed that exploit these data sets, there is a growing need for data publishers, researchers, developers and Web practitioners to understand Linked Data principles and practice. Run by some of the leading members of the Linked Data community, this tutorial will address those needs, and provide participants with a solid foundation from which to begin publishing Linked Data on the Web, as well as to implement applications that consume Linked Data from the Web.

377 citations


Proceedings ArticleDOI
21 Apr 2008
TL;DR: This workshop summary will outline the technical context in which Linked Data is situated, describe developments in the past year through initiatives such as the Linking Open Data community project, and look ahead to the workshop itself.
Abstract: The Web is increasingly understood as a global information space consisting not just of linked documents, but also of Linked Data. More than just a vision, the resulting Web of Data has been brought into being by the maturing of the Semantic Web technology stack, and by the publication of an increasing number of datasets according to the principles of Linked Data.The Linked Data on the Web (LDOW2008) workshop brings together researchers and practitioners working on all aspects of Linked Data. The workshop provides a forum to present the state of the art in the field and to discuss ongoing and future research challenges. In this workshop summary we will outline the technical context in which Linked Data is situated, describe developments in the past year through initiatives such as the Linking Open Data community project, and look ahead to the workshop itself.

Journal ArticleDOI
TL;DR: Sindice, a lookup index over Semantic Web resources, allows applications to automatically locate documents containing information about a given resource, and extends the sitemap protocol to efficiently index large datasets with minimal impact on data providers.
Abstract: Data discovery on the Semantic Web requires crawling and indexing of statements, in addition to the 'linked-data' approach of de-referencing resource URIs Existing Semantic Web search engines are focused on database-like functionality, compromising on index size, query performance and live updates We present Sindice, a lookup index over Semantic Web resources Our index allows applications to automatically locate documents containing information about a given resource In addition, we allow resource retrieval through inverse-functional properties, offer a full-text search and index SPARQL endpoints Finally, we extend the sitemap protocol to efficiently index large datasets with minimal impact on data providers

BookDOI
01 Jan 2008
Abstract: Research Track.- Involving Domain Experts in Authoring OWL Ontologies.- Supporting Collaborative Ontology Development in Protege.- Identifying Potentially Important Concepts and Relations in an Ontology.- RoundTrip Ontology Authoring.- nSPARQL: A Navigational Language for RDF.- An Experimental Comparison of RDF Data Management Approaches in a SPARQL Benchmark Scenario.- Anytime Query Answering in RDF through Evolutionary Algorithms.- The Expressive Power of SPARQL.- Integrating Object-Oriented and Ontological Representations: A Case Study in Java and OWL.- Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery.- Enhancing Semantic Web Services with Inheritance.- Using Semantic Distances for Reasoning with Inconsistent Ontologies.- Statistical Learning for Inductive Query Answering on OWL Ontologies.- Optimization and Evaluation of Reasoning in Probabilistic Description Logic: Towards a Systematic Approach.- Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning.- Comparison between Ontology Distances (Preliminary Results).- Folksonomy-Based Collabulary Learning.- Combining a DL Reasoner and a Rule Engine for Improving Entailment-Based OWL Reasoning.- Improving an RCC-Derived Geospatial Approximation by OWL Axioms.- OWL Datatypes: Design and Implementation.- Laconic and Precise Justifications in OWL.- Learning Concept Mappings from Instance Similarity.- Instanced-Based Mapping between Thesauri and Folksonomies.- Collecting Community-Based Mappings in an Ontology Repository.- Algebras of Ontology Alignment Relations.- Scalable Grounded Conjunctive Query Evaluation over Large and Expressive Knowledge Bases.- A Kernel Revision Operator for Terminologies - Algorithms and Evaluation.- Description Logic Reasoning with Decision Diagrams.- RDF123: From Spreadsheets to RDF.- Evaluating Long-Term Use of the Gnowsis Semantic Desktop for PIM.- Bringing the IPTC News Architecture into the Semantic Web.- RDFS Reasoning and Query Answering on Top of DHTs.- An Interface-Based Ontology Modularization Framework for Knowledge Encapsulation.- On the Semantics of Trust and Caching in the Semantic Web.- Semantic Web Service Choreography: Contracting and Enactment.- Formal Model for Semantic-Driven Service Execution.- Efficient Semantic Web Service Discovery in Centralized and P2P Environments.- Exploring Semantic Social Networks Using Virtual Reality.- Semantic Grounding of Tag Relatedness in Social Bookmarking Systems.- Semantic Modelling of User Interests Based on Cross-Folksonomy Analysis.- ELP: Tractable Rules for OWL 2.- Term Dependence on the Semantic Web.- Semantic Relatedness Measure Using Object Properties in an Ontology.- Semantic Web in Use Track.- Thesaurus-Based Search in Large Heterogeneous Collections.- Deploying Semantic Web Technologies for Work Integrated Learning in Industry - A Comparison: SME vs. Large Sized Company.- Creating and Using Organisational Semantic Webs in Large Networked Organisations.- An Architecture for Semantic Navigation and Reasoning with Patient Data - Experiences of the Health-e-Child Project.- Requirements Analysis Tool: A Tool for Automatically Analyzing Software Requirements Documents.- OntoNaviERP: Ontology-Supported Navigation in ERP Software Documentation.- Market Blended Insight: Modeling Propensity to Buy with the Semantic Web.- DogOnt - Ontology Modeling for Intelligent Domotic Environments.- Introducing IYOUIT.- A Semantic Data Grid for Satellite Mission Quality Analysis.- A Process Catalog for Workflow Generation.- Inference Web in Action: Lightweight Use of the Proof Markup Language.- Supporting Ontology-Based Dynamic Property and Classification in WebSphere Metadata Server.- Towards a Multimedia Content Marketplace Implementation Based on Triplespaces.- Doctoral Consortium Track.- Semantic Enrichment of Folksonomy Tagspaces.- Contracting and Copyright Issues for Composite Semantic Services.- Parallel Computation Techniques for Ontology Reasoning.- Towards Semantic Mapping for Casual Web Users.- Interactive Exploration of Heterogeneous Cultural Heritage Collections.- End-User Assisted Ontology Evolution in Uncertain Domains.- Learning Methods in Multi-grained Query Answering.

Journal ArticleDOI
TL;DR: The purpose of this paper is to identify the exact relationships between these research areas and to determine the boundaries of each field, by performing a broad review of the relevant literature.
Abstract: Ontologies play a key role in the advent of the Semantic Web. An important problem when dealing with ontologies is the modification of an existing ontology in response to a certain need for change. This problem is a complex and multifaceted one, because it can take several different forms and includes several related subproblems, like heterogeneity resolution or keeping track of ontology versions. As a result, it is being addressed by several different, but closely related and often overlapping research disciplines. Unfortunately, the boundaries of each such discipline are not clear, as the same term is often used with different meanings in the relevant literature, creating a certain amount of confusion. The purpose of this paper is to identify the exact relationships between these research areas and to determine the boundaries of each field, by performing a broad review of the relevant literature.

01 Jan 2008
TL;DR: In this special issue, the focus will be on the technical side, although other issues related to knowledge and data engineering for e-Iearning may also be considered.
Abstract: With the advent of the Internet, we are seeing more sophisticated techniques being developed to support e-Iearning. The rapid developme nt of Web-based learning and new concepts like virtual classrooms, virtual laboratories and virtual universities introduces many new issues to be addressed. On the technical side, we need to develop effective e-technologies for supporting distance education. On the learning and management side, we need to consider issues such as new style of learning and different system set-u p requirements. Finally, the issue of standardization of e-Iearning systems should also be considered. In this special issue, our focus will be on the technical side, although other issues related to knowledge and data engineering for e-Iearning may also be considered. Topics: In this special issue, we call for original papers describing novel knowledge and data engineering techniques that support e-Iearning. Preference will be given to papers that include an evaluation of users' experience in using the proposed methods. Areas of interests include, but are not limited to: • Semantic Web technology for e-Iearning • Data modeling (eg., XML) for efficient management of course materials • Searching and indexing techniques to suppo rt effective course notes retrieval • User-centric e-Iearning systems and user interaction management • Profiling techniques to support grading and learning recommendation • Data and knowledge base suppo rt for pervasive e-Iearning • Course material analysis and understanding • Automatic generation of questions and answers • Collaborative communities for e-Iearning

Journal ArticleDOI
TL;DR: This paper attempts to place some perspective on past efforts, highlight the reasons for success and failure, and indicate some pointers to the future on data integration.

Journal ArticleDOI
TL;DR: This paper presents sound and complete algorithms for the main reasoning problems in the new probabilistic description logics, which are based on reductions to reasoning in their classical counterparts, and to solving linear optimization problems.

Book ChapterDOI
26 Oct 2008
TL;DR: A new house modeling ontology designed to fit real world domotic system capabilities and to support interoperation between currently available and future solutions is proposed.
Abstract: Home automation has recently gained a new momentum thanks to the ever-increasing commercial availability of domotic components In this context, researchers are working to provide interoperation mechanisms and to add intelligence on top of them For supporting intelligent behaviors, house modeling is an essential requirement to understand current and future house states and to possibly drive more complex actions In this paper we propose a new house modeling ontology designed to fit real world domotic system capabilities and to support interoperation between currently available and future solutions Taking advantage of technologies developed in the context of the Semantic Web, the DogOnt ontology supports device/network independent description of houses, including both "controllable" and architectural elements States and functionalities are automatically associated to the modeled elements through proper inheritance mechanisms and by means of properly defined SWRL auto-completion rules which ease the modeling process, while automatic device recognition is achieved through classification reasoning

Book ChapterDOI
26 Oct 2008
TL;DR: Several measures of tag similarity are analyzed and a semantic grounding is provided by mapping pairs of similar tags in the folksonomy to pairs of synsets in Wordnet, where validated measures of semantic distance characterize the semantic relation between the mapped tags.
Abstract: Collaborative tagging systems have nowadays become important data sources for populating semantic web applications. For tasks like synonym detection and discovery of concept hierarchies, many researchers introduced measures of tag similarity. Even though most of these measures appear very natural, their design often seems to be rather ad hoc, and the underlying assumptions on the notion of similarity are not made explicit. A more systematic characterization and validation of tag similarity in terms of formal representations of knowledge is still lacking. Here we address this issue and analyze several measures of tag similarity: Each measure is computed on data from the social bookmarking system del.icio.us and a semantic grounding is provided by mapping pairs of similar tags in the folksonomy to pairs of synsets in Wordnet, where we use validated measures of semantic distance to characterize the semantic relation between the mapped tags. This exposes important features of the investigated similarity measures and indicates which ones are better suited in the context of a given semantic application.

Journal ArticleDOI
TL;DR: The games presented are the first prototypes of the OntoGame series, a collection of scenarios for creating, extending, and updating formal knowledge structures for the semantic Web.
Abstract: Weaving the semantic Web requires that humans contribute their labor and judgment for creating, extending, and updating formal knowledge structures. Hiding such tasks behind online multiplayer games presents the tasks as fun and intellectually challenging entertainment. The games we've presented are the first prototypes of the OntoGame series. We're extending and improving the scenarios in several directions.

Proceedings ArticleDOI
11 Jun 2008
TL;DR: Two different ways to support the NIST Standard RBAC model in OWL are shown and how the OWL constructions can be extended to model attribute-based RBAC or more generally attribute- based access control are discussed.
Abstract: There have been two parallel themes in access control research in recent years. On the one hand there are efforts to develop new access control models to meet the policy needs of real world application domains. In parallel, and almost separately, researchers have developed policy languages for access control. This paper is motivated by the consideration that these two parallel efforts need to develop synergy. A policy language in the abstract without ties to a model gives the designer little guidance. Conversely a model may not have the machinery to express all the policy details of a given system or may deliberately leave important aspects unspecified. Our vision for the future is a world where advanced access control concepts are embodied in models that are supported by policy languages in a natural intuitive manner, while allowing for details beyond the models to be further specified in the policy language.This paper studies the relationship between the Web Ontology Language (OWL) and the Role Based Access Control (RBAC) model. Although OWL is a web ontology language and not specifically designed for expressing authorization policies, it has been used successfully for this purpose in previous work. OWL is a leading specification language for the Semantic Web, making it a natural vehicle for providing access control in that context. In this paper we show two different ways to support the NIST Standard RBAC model in OWL and then discuss how the OWL constructions can be extended to model attribute-based RBAC or more generally attribute-based access control. We further examine and assess OWL's suitability for two other access control problems: supporting attribute based access control and performing security analysis in a trust-management framework.

Journal ArticleDOI
TL;DR: Experimental results show that the deployment of EASY on top of an existing SDP, namely Ariadne, enables rich semantic, context- and QoS-aware service discovery, which furthermore performs better than the classical, rigid, syntactic matching, and improves the scalability ofAriadne.

Journal ArticleDOI
TL;DR: N3Logic is a logic that allows rules to be expressed in a Web environment that extends RDF with syntax for nested graphs and quantified variables and with predicates for implication and accessing resources on the Web, and functions including cryptographic, string, math.
Abstract: The Semantic Web drives toward the use of the Web for interacting with logically interconnected data. Through knowledge models such as Resource Description Framework (RDF), the Semantic Web provides a unifying representation of richly structured data. Adding logic to the Web implies the use of rules to make inferences, choose courses of action, and answer questions. This logic must be powerful enough to describe complex properties of objects but not so powerful that agents can be tricked by being asked to consider a paradox. The Web has several characteristics that can lead to problems when existing logics are used, in particular, the inconsistencies that inevitably arise due to the openness of the Web, where anyone can assert anything. N3Logic is a logic that allows rules to be expressed in a Web environment. It extends RDF with syntax for nested graphs and quantified variables and with predicates for implication and accessing resources on the Web, and functions including cryptographic, string, math. The main goal of N3Logic is to be a minimal extension to the RDF data model such that the same language can be used for logic and data. In this paper, we describe N3Logic and illustrate through examples why it is an appropriate logic for the Web.

01 Jan 2008
TL;DR: MOAT, a lightweight Semantic Web framework that provides a collaborative way to let Web 2.0 content producers give meanings to their tags in a machine- readable way, relies on Linked Data principles, using URIs from existing resources to define these meanings.
Abstract: This paper introduces MOAT, a lightweight Semantic Web framework that provides a collaborative way to let Web 2.0 content producers give meanings to their tags in a machine- readable way. To achieve this goal, this approach relies on Linked Data principles, using URIs from existing resources to define these meanings. That way, users can create inter- linked RDF data and let their content enter the Semantic Web, while solving some limits of free-tagging at the same time.


Journal ArticleDOI
TL;DR: This article discusses putting these together, with linked semantics coupled to linked social networks, to deliver a much greater effect on the power of the Web.

Journal ArticleDOI
TL;DR: A complete framework and findings in mining Web usage patterns from Web log files of a real Web site that has all the challenging aspects of real-life Web usage mining, including evolving user profiles and external data describing an ontology of the Web content is presented.
Abstract: In this paper, we present a complete framework and findings in mining Web usage patterns from Web log files of a real Web site that has all the challenging aspects of real-life Web usage mining, including evolving user profiles and external data describing an ontology of the Web content. Even though the Web site under study is part of a nonprofit organization that does not "sell" any products, it was crucial to understand "who" the users were, "what" they looked at, and "how their interests changed with time," all of which are important questions in Customer Relationship Management (CRM). Hence, we present an approach for discovering and tracking evolving user profiles. We also describe how the discovered user profiles can be enriched with explicit information need that is inferred from search queries extracted from Web log data. Profiles are also enriched with other domain-specific information facets that give a panoramic view of the discovered mass usage modes. An objective validation strategy is also used to assess the quality of the mined profiles, in particular their adaptability in the face of evolving user behavior.

Journal ArticleDOI
TL;DR: This work formulate the Web-service composition problem in terms of AI planning and network optimization problems to investigate its complexity in detail, and develops a novel AI planning-based heuristic Web- service composition algorithm named WSPR.
Abstract: The main research focus of Web services is to achieve the interoperability between distributed and heterogeneous applications. Therefore, flexible composition of Web services to fulfill the given challenging requirements is one of the most important objectives in this research field. However, until now, service composition has been largely an error-prone and tedious process. Furthermore, as the number of available web services increases, finding the right Web services to satisfy the given goal becomes intractable. In this paper, toward these issues, we propose an AI planning-based framework that enables the automatic composition of Web services, and explore the following issues. First, we formulate the Web-service composition problem in terms of AI planning and network optimization problems to investigate its complexity in detail. Second, we analyze publicly available Web service sets using network analysis techniques. Third, we develop a novel Web-service benchmark tool called WSBen. Fourth, we develop a novel AI planning-based heuristic Web-service composition algorithm named WSPR. Finally, we conduct extensive experiments to verify WSPR against state-of-the-art AI planners. It is our hope that both WSPR and WSBen will provide useful insights for researchers to develop Web-service discovery and composition algorithms, and software.

Proceedings Article
01 Jan 2008
TL;DR: This paper gives an in-depth study of the Web's HTML table corpus, and describes a system for performing relation recovery that achieves precision and recall that are comparable to other domain-independent information extraction systems.
Abstract: World-Wide Web consists of a huge number of unstruc- tured hypertext documents, but it also contains structured data in the form of HTML tables. Many of these tables contain both relational-style data and a small "schema" of labeled and typed columns, making each such table a small structured database. The WebTables project is an effort to extract and make use of the huge number of these structured tables on the Web. A clean collection of relational-style ta- bles could be useful for improving web search, schema de- sign, and many other applications. This paper describes the first stage of the WebTables project. First, we give an in-depth study of the Web's HTML table corpus. For example, we extracted 14.1 billion HTML ta- bles from a several-billion-page portion of Google's general- purpose web crawl, and estimate that 154 million of these tables contain high-quality relational-style data. We also de- scribe the crawl's distribution of table sizes and data types. Second, we describe a system for performing relation recov- ery. The Web mixes relational and non-relational tables indiscriminately (often on the same page), so there is no simple way to distinguish the 1.1% of good relations from the remainder, nor to recover column label and type infor- mation. Our mix of hand-written detectors and statistical classifiers takes a raw Web crawl as input, and generates a collection of databases that is five orders of magnitude larger than any other collection we are aware of. Relation recovery achieves precision and recall that are comparable to other domain-independent information extraction systems.

01 Jan 2008
TL;DR: This document specifies how to use RDFa with XHTML, a specification for attributes to express structured data in any markup language that allows authors and publishers of data to define their own formats without having to update software, register formats via a central authority, or worry that two formats may interfere with each other.
Abstract: The current Web is primarily made up of an enormous number of documents that have been created using HTML. These documents contain significant amounts of structured data, which is largely unavailable to tools and applications. When publishers can express this data more completely, and when tools can read it, a new world of user functionality becomes available, letting users transfer structured data between applications and web sites, and allowing browsing applications to improve the user experience: an event on a web page can be directly imported into a user's desktop calendar; a license on a document can be detected so that users can be informed of their rights automatically; a photo's creator, camera setting information, resolution, location and topic can be published as easily as the original photo itself, enabling structured search and sharing. RDFa is a specification for attributes to express structured data in any markup language. This document specifies how to use RDFa with XHTML. The rendered, hypertext data of XHTML is reused by the RDFa markup, so that publishers don't need to repeat significant data in the document content. The underlying abstract representation is RDF [RDF-PRIMER], which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time. The expressed structure is closely tied to the data, so that rendered data can be copied and pasted along with its relevant structure. The rules for interpreting the data are generic, so that there is no need for different rules for different formats; this allows authors and publishers of data to define their own formats without having to update software, register formats via a central authority, or worry that two formats may interfere with each other. RDFa shares some use cases with microformats [MICROFORMATS]. Whereas microformats specify both a syntax for embedding structured data into HTML documents and a vocabulary of specific terms for each microformat, RDFa specifies only a syntax and relies on independent specification of terms (often called vocabularies or taxonomies) by others. RDFa allows terms from multiple independently-developed vocabularies to be freely intermixed and is designed such that the language can be parsed without knowledge of the specific term vocabulary being used. This document is a detailed syntax specification for RDFa, aimed at: those looking to create an RDFa parser, and who therefore need a detailed description of the parsing rules; those looking to recommend the use of RDFa within their organisation, and who would like to create some guidelines for their users; anyone familiar with RDF, and who wants to understand more about what is happening 'under the hood', when an RDFa parser runs. For those looking for an introduction to the use of RDFa and some real-world examples, please consult the RDFa Primer.