An open-source toolkit for mining Wikipedia

doi:10.1016/J.ARTINT.2012.06.007

Home
/
Papers
/
An open-source toolkit for mining Wikipedia

Journal Article•DOI•

An open-source toolkit for mining Wikipedia

David Milne¹, Ian H. Witten¹•Institutions (1)

University of Waikato¹

01 Jan 2013-Artificial Intelligence (Elsevier)-Vol. 194, pp 222-239

TL;DR: The Wikipedia Miner toolkit is introduced, an open-source software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications, and creates databases that contain summarized versions of Wikipedia's content and structure.

read less

About: This article is published in Artificial Intelligence.The article was published on 2013-01-01 and is currently open access. It has received 382 citations till now. The article focuses on the topics: Semantic similarity & Document Structure Description.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions

[...]

Wei Shen¹, Jianyong Wang², Jiawei Han³•Institutions (3)

Nankai University¹, Tsinghua University², University of Illinois at Urbana–Champaign³

01 Feb 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.

...read moreread less

Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

...read moreread less

702 citations

"Big Data" : big gaps of knowledge in the field of internet science

[...]

Chris Snijders, Uwe Matzat, Ulf-Dietrich Reips

01 Jan 2012

TL;DR: This work proposes to follow in addition a different strategy that leads to knowledge about micro-processes that match with actual online behavior that can be used for the selection of mathematically-tractable models of online network formation and evolution.

...read moreread less

Abstract: Research on so-called ‘Big Data’ has received a considerable momentum and is expected to grow in the future. One very interesting stream of research on Big Data analyzes online networks. Many online networks are known to have some typical macro-characteristics, such as ‘small world’ properties. Much less is known about underlying micro-processes leading to these properties. The models used by Big Data researchers usually are inspired by mathematical ease of exposition. We propose to follow in addition a different strategy that leads to knowledge about micro-processes that match with actual online behavior. This knowledge can then be used for the selection of mathematically-tractable models of online network formation and evolution. Insight from social and behavioral research is needed for pursuing this strategy of knowledge generation about micro-processes. Accordingly, our proposal points to a unique role that social scientists could play in Big Data research.

...read moreread less

343 citations

Cites background from "An open-source toolkit for mining W..."

...Scientists have begun to develop Web services with interfaces to collectors of Big Data sets, e.g., Milne and Witten (2009) for Wikipedia at http://wikipedia-miner.cms.waikato.ac.nz/ and Reips and Garaizar (2011) for Twitter at http://tweetminer.eu....
[...]

Proceedings Article•DOI•

From Freebase to Wikidata: The Great Migration

[...]

Thomas Pellissier Tanon¹, Denny Vrandecic¹, Sebastian Schaffert¹, Thomas Steiner¹, Lydia Pintscher - Show less +1 more•Institutions (1)

Google¹

11 Apr 2016

TL;DR: The Primary Sources Tool is described, which aims to facilitate this and future data migrations and report on the ongoing transfer efforts and data mapping challenges.

...read moreread less

Abstract: Collaborative knowledge bases that make their data freely available in a machine-readable form are central for the data strategy of many projects and organizations. The two major collaborative knowledge bases are Wikimedia's Wikidata and Google's Freebase. Due to the success of Wikidata, Google decided in 2014 to offer the content of Freebase to the Wikidata community. In this paper, we report on the ongoing transfer efforts and data mapping challenges, and provide an analysis of the effort so far. We describe the Primary Sources Tool, which aims to facilitate this and future data migrations. Throughout the migration, we have gained deep insights into both Wikidata and Freebase, and share and discuss detailed statistics on both knowledge bases.

...read moreread less

185 citations

Additional excerpts

...Most of the research is focused on Wikipedia [11], which is understandable considering the availability of its data sets, in particular the whole edit history [27] and the availability of tools for working with Wikipedia [22]....
[...]

Journal Article•DOI•

Internet of Things: A Review of Surveys Based on Context Aware Intelligent Services

[...]

David Gil¹, Antonio Ferrández¹, Higinio Mora-Mora¹, Jesús Peral¹•Institutions (1)

University of Alicante¹

11 Jul 2016-Sensors

TL;DR: A state-of-the-art of IoT from the context aware perspective is presented that allows the integration of IoT and social networks in the emerging Social Internet of Things (SIoT) term.

...read moreread less

Abstract: The Internet of Things (IoT) has made it possible for devices around the world to acquire information and store it, in order to be able to use it at a later stage. However, this potential opportunity is often not exploited because of the excessively big interval between the data collection and the capability to process and analyse it. In this paper, we review the current IoT technologies, approaches and models in order to discover what challenges need to be met to make more sense of data. The main goal of this paper is to review the surveys related to IoT in order to provide well integrated and context aware intelligent services for IoT. Moreover, we present a state-of-the-art of IoT from the context aware perspective that allows the integration of IoT and social networks in the emerging Social Internet of Things (SIoT) term.

...read moreread less

180 citations

Cites methods from "An open-source toolkit for mining W..."

...[127,128] R Open source programming language and software environment, is designed for data mining/analysis and visualization....
[...]

Journal Article•DOI•

Collaboratively built semi-structured content and Artificial Intelligence: The story so far

[...]

Eduard Hovy¹, Roberto Navigli², Simone Paolo Ponzetto²•Institutions (2)

University of Southern California¹, Sapienza University of Rome²

01 Jan 2013-Artificial Intelligence

TL;DR: The overall picture shows that not only are semi-structured resources enabling a renaissance of knowledge-rich AI techniques, but also that significant advances in high-end applications that require deep understanding capabilities can be achieved by synergistically exploiting large amounts of machine-readable structured knowledge in combination with sound statistical AI and NLP techniques.

...read moreread less

176 citations

Cites background or methods from "An open-source toolkit for mining W..."

...We argue that the popularity enjoyed by this line of research is a consequence of the fact that (i) it provides a viable solution to some of AI’s long-lasting problems, crucially including the quest for knowledge [179]; (ii) it has wide applicability spanning many different sub-areas of AI – as shown by the papers found in this special issue, which range from computational neuroscience [154] to information retrieval [82,102], through works in knowledge acquisition [73,130,192] and a variety of NLP applications such as Named Entity Recognition [148], Named Entity disambiguation [67] and computing semantic relatedness [122,216]....
[...]
...The Wikipedia Miner toolkit from Milne and Witten [122] makes the supervised wikification system originally presented in [121] freely available, while Tonelli et al. [192] present instead the Wiki Machine, a high-performance wikification system which is shown to outperform Wikipedia Miner thanks to a state-of-the-art kernel-based WSD algorithm [62]....
[...]
...Milne and Witten [120] compared a tf *idf -like measure computed on Wikipedia links with a more refined link co-occurrence measure modeled after the Normalized Google Distance [37]....
[...]
...Finally, the last two papers present tools for working with semi-structured resources like Wikipedia [122], and its use to acquire computational semantic models of mental representations of concepts [154]....
[...]
...The Wikipedia Miner toolkit from Milne and Witten [122] makes the supervised wikification system originally presented in [121] freely available, while Tonelli et al....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

06 Dec 2004

TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

...read moreread less

20,309 citations

"An open-source toolkit for mining W..." refers background in this paper

...The performance of this file-based database is a bottleneck for many applications....
[...]

Book•

Data Mining: Practical Machine Learning Tools and Techniques

[...]

Ian H. Witten, Eibe Frank, Mark Hall

25 Oct 1999

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

...read moreread less

Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

...read moreread less

20,196 citations

"An open-source toolkit for mining W..." refers methods in this paper

...It also provides a platform for sharing mining techniques, and for taking advantage of powerful technologies like the distributed computing framework Hadoop [27] and the Weka machine learning workbench [28]....
[...]

Journal Article•DOI•

MapReduce: simplified data processing on large clusters

[...]

Jeffrey Dean¹, Sanjay Ghemawat¹•Institutions (1)

Google¹

01 Jan 2008-Communications of The ACM

TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.

...read moreread less

Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

...read moreread less

17,663 citations

Book•

Data Mining

[...]

Ian Witten

01 Jan 2008

TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.

...read moreread less

Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

...read moreread less

9,995 citations

Journal Article•DOI•

The Google file system

[...]

Sanjay Ghemawat¹, Howard Gobioff¹, Shun-Tak Albert Leung¹•Institutions (1)

Google¹

19 Oct 2003

TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.

...read moreread less

Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

...read moreread less

5,429 citations