scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An open-source toolkit for mining Wikipedia

01 Jan 2013-Artificial Intelligence (Elsevier)-Vol. 194, pp 222-239
TL;DR: The Wikipedia Miner toolkit is introduced, an open-source software system that allows researchers and developers to integrate Wikipedia's rich semantics into their own applications, and creates databases that contain summarized versions of Wikipedia's content and structure.
About: This article is published in Artificial Intelligence.The article was published on 2013-01-01 and is currently open access. It has received 382 citations till now. The article focuses on the topics: Semantic similarity & Document Structure Description.
Citations
More filters
Journal ArticleDOI
TL;DR: A thorough overview and analysis of the main approaches to entity linking is presented, and various applications, the evaluation of entity linking systems, and future directions are discussed.
Abstract: The large number of potential applications from bridging web data with knowledge bases have led to an increase in the entity linking research. Entity linking is the task to link entity mentions in text with their corresponding entities in a knowledge base. Potential applications include information extraction, information retrieval, and knowledge base population. However, this task is challenging due to name variations and entity ambiguity. In this survey, we present a thorough overview and analysis of the main approaches to entity linking, and discuss various applications, the evaluation of entity linking systems, and future directions.

702 citations

01 Jan 2012
TL;DR: This work proposes to follow in addition a different strategy that leads to knowledge about micro-processes that match with actual online behavior that can be used for the selection of mathematically-tractable models of online network formation and evolution.
Abstract: Research on so-called ‘Big Data’ has received a considerable momentum and is expected to grow in the future. One very interesting stream of research on Big Data analyzes online networks. Many online networks are known to have some typical macro-characteristics, such as ‘small world’ properties. Much less is known about underlying micro-processes leading to these properties. The models used by Big Data researchers usually are inspired by mathematical ease of exposition. We propose to follow in addition a different strategy that leads to knowledge about micro-processes that match with actual online behavior. This knowledge can then be used for the selection of mathematically-tractable models of online network formation and evolution. Insight from social and behavioral research is needed for pursuing this strategy of knowledge generation about micro-processes. Accordingly, our proposal points to a unique role that social scientists could play in Big Data research.

343 citations


Cites background from "An open-source toolkit for mining W..."

  • ...Scientists have begun to develop Web services with interfaces to collectors of Big Data sets, e.g., Milne and Witten (2009) for Wikipedia at http://wikipedia-miner.cms.waikato.ac.nz/ and Reips and Garaizar (2011) for Twitter at http://tweetminer.eu....

    [...]

Proceedings ArticleDOI
11 Apr 2016
TL;DR: The Primary Sources Tool is described, which aims to facilitate this and future data migrations and report on the ongoing transfer efforts and data mapping challenges.
Abstract: Collaborative knowledge bases that make their data freely available in a machine-readable form are central for the data strategy of many projects and organizations. The two major collaborative knowledge bases are Wikimedia's Wikidata and Google's Freebase. Due to the success of Wikidata, Google decided in 2014 to offer the content of Freebase to the Wikidata community. In this paper, we report on the ongoing transfer efforts and data mapping challenges, and provide an analysis of the effort so far. We describe the Primary Sources Tool, which aims to facilitate this and future data migrations. Throughout the migration, we have gained deep insights into both Wikidata and Freebase, and share and discuss detailed statistics on both knowledge bases.

185 citations


Additional excerpts

  • ...Most of the research is focused on Wikipedia [11], which is understandable considering the availability of its data sets, in particular the whole edit history [27] and the availability of tools for working with Wikipedia [22]....

    [...]

Journal ArticleDOI
11 Jul 2016-Sensors
TL;DR: A state-of-the-art of IoT from the context aware perspective is presented that allows the integration of IoT and social networks in the emerging Social Internet of Things (SIoT) term.
Abstract: The Internet of Things (IoT) has made it possible for devices around the world to acquire information and store it, in order to be able to use it at a later stage. However, this potential opportunity is often not exploited because of the excessively big interval between the data collection and the capability to process and analyse it. In this paper, we review the current IoT technologies, approaches and models in order to discover what challenges need to be met to make more sense of data. The main goal of this paper is to review the surveys related to IoT in order to provide well integrated and context aware intelligent services for IoT. Moreover, we present a state-of-the-art of IoT from the context aware perspective that allows the integration of IoT and social networks in the emerging Social Internet of Things (SIoT) term.

180 citations


Cites methods from "An open-source toolkit for mining W..."

  • ...[127,128] R Open source programming language and software environment, is designed for data mining/analysis and visualization....

    [...]

Journal ArticleDOI
TL;DR: The overall picture shows that not only are semi-structured resources enabling a renaissance of knowledge-rich AI techniques, but also that significant advances in high-end applications that require deep understanding capabilities can be achieved by synergistically exploiting large amounts of machine-readable structured knowledge in combination with sound statistical AI and NLP techniques.

176 citations


Cites background or methods from "An open-source toolkit for mining W..."

  • ...We argue that the popularity enjoyed by this line of research is a consequence of the fact that (i) it provides a viable solution to some of AI’s long-lasting problems, crucially including the quest for knowledge [179]; (ii) it has wide applicability spanning many different sub-areas of AI – as shown by the papers found in this special issue, which range from computational neuroscience [154] to information retrieval [82,102], through works in knowledge acquisition [73,130,192] and a variety of NLP applications such as Named Entity Recognition [148], Named Entity disambiguation [67] and computing semantic relatedness [122,216]....

    [...]

  • ...The Wikipedia Miner toolkit from Milne and Witten [122] makes the supervised wikification system originally presented in [121] freely available, while Tonelli et al. [192] present instead the Wiki Machine, a high-performance wikification system which is shown to outperform Wikipedia Miner thanks to a state-of-the-art kernel-based WSD algorithm [62]....

    [...]

  • ...Milne and Witten [120] compared a tf *idf -like measure computed on Wikipedia links with a more refined link co-occurrence measure modeled after the Normalized Google Distance [37]....

    [...]

  • ...Finally, the last two papers present tools for working with semi-structured resources like Wikipedia [122], and its use to acquire computational semantic models of mental representations of concepts [154]....

    [...]

  • ...The Wikipedia Miner toolkit from Milne and Witten [122] makes the supervised wikification system originally presented in [121] freely available, while Tonelli et al....

    [...]

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations


"An open-source toolkit for mining W..." refers background in this paper

  • ...The performance of this file-based database is a bottleneck for many applications....

    [...]

Book
25 Oct 1999
TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.
Abstract: Data Mining: Practical Machine Learning Tools and Techniques offers a thorough grounding in machine learning concepts as well as practical advice on applying machine learning tools and techniques in real-world data mining situations. This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining. Thorough updates reflect the technical changes and modernizations that have taken place in the field since the last edition, including new material on Data Transformations, Ensemble Learning, Massive Data Sets, Multi-instance Learning, plus a new version of the popular Weka machine learning software developed by the authors. Witten, Frank, and Hall include both tried-and-true techniques of today as well as methods at the leading edge of contemporary research. *Provides a thorough grounding in machine learning concepts as well as practical advice on applying the tools and techniques to your data mining projects *Offers concrete tips and techniques for performance improvement that work by transforming the input or output in machine learning methods *Includes downloadable Weka software toolkit, a collection of machine learning algorithms for data mining tasks-in an updated, interactive interface. Algorithms in toolkit cover: data pre-processing, classification, regression, clustering, association rules, visualization

20,196 citations


"An open-source toolkit for mining W..." refers methods in this paper

  • ...It also provides a platform for sharing mining techniques, and for taking advantage of powerful technologies like the distributed computing framework Hadoop [27] and the Weka machine learning workbench [28]....

    [...]

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Book
01 Jan 2008
TL;DR: In this paper, generalized estimating equations (GEE) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC are discussed.
Abstract: tic regression, and it concerns studying the effect of covariates on the risk of disease. The chapter includes generalized estimating equations (GEE’s) with computing using PROC GENMOD in SAS and multilevel analysis of clustered binary data using generalized linear mixed-effects models with PROC LOGISTIC. As a prelude to the following chapter on repeated-measures data, Chapter 5 presents time series analysis. The material on repeated-measures analysis uses linear additive models with GEE’s and PROC MIXED in SAS for linear mixed-effects models. Chapter 7 is about survival data analysis. All computing throughout the book is done using SAS procedures.

9,995 citations

Journal ArticleDOI
19 Oct 2003
TL;DR: This paper presents file system interface extensions designed to support distributed applications, discusses many aspects of the design, and reports measurements from both micro-benchmarks and real world use.
Abstract: We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led us to reexamine traditional choices and explore radically different design points. The file system has successfully met our storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by our service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients. In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

5,429 citations