scispace - formally typeset
Proceedings ArticleDOI

Managing information extraction: state of the art and research directions

TLDR
This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text), and shows how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools.
Abstract
This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text). We first survey research on information extraction in the database, AI, NLP, IR, and Web communities in recent years. Then we discuss why this is the right time for the database community to actively participate and address the problem of managing information extraction (including in particular the challenges of maintaining and querying the extracted information, and accounting for the imprecision and uncertainty inherent in the extraction process). Finally, we show how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools. We do not assume prior knowledge of text management, NLP, extraction techniques, or machine learning.

read more

Citations
More filters
Journal ArticleDOI

YAGO: A Large Ontology from Wikipedia and WordNet

TL;DR: YAGO is a large ontology with high coverage and precision, based on a clean logical model with a decidable consistency that allows representing n-ary relations in a natural way while maintaining compatibility with RDFS.
Proceedings ArticleDOI

NAGA: Searching and Ranking Knowledge

TL;DR: This paper proposes NAGA, a new semantic search engine that builds on a knowledge base, which is organized as a graph with typed edges, and consists of millions of entities and relationships extracted from Web-based corpora.
Proceedings Article

Declarative information extraction using datalog with embedded extraction predicates

TL;DR: This paper argues that developing information extraction programs using Datalog with embedded procedural extraction predicates is a good way to proceed, and shows how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework.
Proceedings Article

EntityRank: searching entities directly and holistically

TL;DR: This work focuses on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking.
Journal ArticleDOI

On the provenance of non-answers to queries over extracted data

TL;DR: This work focuses on providing provenance-style explanations for non-answers and develops a mechanism for providing this new type of provenance and suggests that this approach can provide effective provenance information that can help a user resolve their doubts over non-ANSwers to a query.
Related Papers (5)