Managing information extraction: state of the art and research directions

doi:10.1145/1142473.1142595

Proceedings ArticleDOI

Managing information extraction: state of the art and research directions

- pp 799-800

TLDR

This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text), and shows how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools.

Abstract:

This tutorial makes the case for developing a unified framework that manages information extraction from unstructured data (focusing in particular on text). We first survey research on information extraction in the database, AI, NLP, IR, and Web communities in recent years. Then we discuss why this is the right time for the database community to actively participate and address the problem of managing information extraction (including in particular the challenges of maintaining and querying the extracted information, and accounting for the imprecision and uncertainty inherent in the extraction process). Finally, we show how interested researchers can take the next step, by pointing to open problems, available datasets, applicable standards, and software tools. We do not assume prior knowledge of text management, NLP, extraction techniques, or machine learning.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

YAGO: A Large Ontology from Wikipedia and WordNet

Fabian M. Suchanek, +2 more

- 01 Sep 2008 -

Journal of Web Semantics

TL;DR: YAGO is a large ontology with high coverage and precision, based on a clean logical model with a decidable consistency that allows representing n-ary relations in a natural way while maintaining compatibility with RDFS.

...read moreread less

Proceedings ArticleDOI

NAGA: Searching and Ranking Knowledge

Gjergji Kasneci, +4 more

TL;DR: This paper proposes NAGA, a new semantic search engine that builds on a knowledge base, which is organized as a graph with typed edges, and consists of millions of entities and relationships extracted from Web-based corpora.

...read moreread less

Proceedings Article

Declarative information extraction using datalog with embedded extraction predicates

Warren Shen, +3 more

TL;DR: This paper argues that developing information extraction programs using Datalog with embedded procedural extraction predicates is a good way to proceed, and shows how optimizing such programs raises challenges specific to text data that cannot be accommodated in the current relational optimization framework.

...read moreread less

Proceedings Article

EntityRank: searching entities directly and holistically

Tao Cheng, +2 more

TL;DR: This work focuses on the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking.

...read moreread less

Journal ArticleDOI

On the provenance of non-answers to queries over extracted data

Jiansheng Huang, +3 more

TL;DR: This work focuses on providing provenance-style explanations for non-answers and develops a mechanism for providing this new type of provenance and suggests that this approach can provide effective provenance information that can help a user resolve their doubts over non-ANSwers to a query.

...read moreread less

Collapse

Managing information extraction: state of the art and research directions

Citations

YAGO: A Large Ontology from Wikipedia and WordNet

NAGA: Searching and Ranking Knowledge

Declarative information extraction using datalog with embedded extraction predicates

EntityRank: searching entities directly and holistically

On the provenance of non-answers to queries over extracted data

Related Papers (5)

Declarative information extraction using datalog with embedded extraction predicates

An Algebraic Approach to Rule-Based Information Extraction

Information Extraction

A framework and graphical development environment for robust NLP tools and applications.

UIMA: an architectural approach to unstructured information processing in the corporate research environment