scispace - formally typeset
Open AccessBook ChapterDOI

KNIME: The Konstanz Information Miner

TLDR
The Konstanz Information Miner as mentioned in this paper is a modular environment that enables easy visual assembly and interactive execution of a data pipeline, which is designed as a teaching, research and collaboration platform.
Abstract
The Konstanz Information Miner is a modular environment, which enables easy visual assembly and interactive execution of a data pipeline. It is designed as a teaching, research and collaboration platform, which enables simple integration of new algorithms and tools as well as data manipulation or visualization methods in the form of new modules or nodes. In this paper we describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated.

read more

Content maybe subject to copyright    Report

KNIME: the Konstanz Information Miner
Conference or Workshop Item
Accepted Version
Berthold, M. R., Cebron, N., Dill, F., Di Fatta, G., Gabriel, T.
R., Georg, F., Meinl, T., Ohl, P., Sieb, C. and Wiswedel, B.
(2006) KNIME: the Konstanz Information Miner. In: Workshop
on Multi-Agent Systems and Simulation (MAS&S), 4th Annual
Industrial Simulation Conference (ISC), 05-07 June 2006,
Palermo, Italy, pp. 58-61. Available at
https://centaur.reading.ac.uk/6139/
It is advisable to refer to the publishers version if you intend to cite from the
work. See Guidance on citing
.
All outputs in CentAUR are protected by Intellectual Property Rights law,
including copyright law. Copyright and IPR is retained by the creators or other
copyright holders. Terms and conditions for use of this material are dened in
the End User Agreement
.
www.reading.ac.uk/centaur
CentAUR
Central Archive at the University of Reading

Readings research outputs online

KNIME: THE KONSTANZ INFORMATION MINER
Michael R. Berthold, Nicolas Cebron, Fabian Dill, Giuseppe Di Fatta
,ThomasR.Gabriel,
Florian Georg, Thorsten Meinl, Peter Ohl,
Christoph Sieb and Bernd Wiswedel
Konstanz University, Department of Computer and Information Science
Fach M712, 78457 Konstanz, Germany
E-mail: Michael.Berthold@uni-konstanz.de
ABSTRACT
The Konstanz Information Miner is a modular environ-
ment which enables easy visual assembly and interactive
execution of a data pipeline. It is designed as a teaching,
research and collaboration platform, which enables easy
integration of new algorithms, data manipulation or vi-
sualization methods as new modules or nodes. In this
paper we describe some of the design aspects of the un-
derlying architecture and briefly sketch how new nodes
can be incorporated.
OVERVIEW
Large volumes of data are often generated during simu-
lations and the need for modular data analysis environ-
ments has increased dramatically over the past years.
In order to make use of the vast variety of data analysis
methods around, it is essential that such an environ-
ment is easy and intuitive to use, allows for quick and
interactive changes to the analysis and enables the user
to visually explore the results. To meet these challenges
a data pipelining environment is an appropriate model.
It allows the user to visually assemble and adapt the
analysis flow from standardized building blocks, at the
same time offering an intuitive, graphical way to docu-
ment what has been done.
Knime, the Konstanz Information Miner provides such
an environment. Figure 1 shows a screenshot of an ex-
ample analysis flow. In the center, a flow is reading in
data from three sources and processes it in several, also
parallel analysis flows, consisting of preprocessing, mod-
eling, and visualization nodes. On the left a repository
of nodes is shown. From this large variety of nodes,
one can select data sources, data preprocessing steps,
model building algorithms, visualization techniques as
well as model I/O tools and drag them onto the work-
bench where they can be connected to other nodes. The
ability to have all views interact graphically creates a
powerful environment to explore the data sets at hand.
Knime is written in Java and it’s graphical workflow ed-
itor is implemented as an Eclipse (Eclipse Foundation
G. Di Fatta is also with ICAR-CNR, National Research Coun-
cil, Palermo, Italy.
2005) plug-in. It is easy to extend through an open API
and a data abstraction framework, which allows for new
nodes to be quickly added in a well-defined way.
In this paper we will describe some of the internals of
Knime in more detail. More information as well as down-
loads can be found at http://www.knime.org.
ARCHITECTURE
The architecture of Knime was designed with three main
principles in mind:
visual, interactive framework: data flows should be
combined by simple drag&drop from a variety of
processing units. Customized applications can be
modelled through individual data pipelines.
modularity: Processing units and data containers
should not depend on each other in order to enable
easy distribution of computation and allow for inde-
pendent development of different algorithms. Data
Types are encapsulated, that is, no types are prede-
fined, new types can easily be added bringing along
type specific renderers and comparators. New types
can be declared compatible to existing types.
easy expandability: It should be easy to add new
processing nodes, or views and distribute them
through a simple plug&play principle without the
need for complicated install/deinstall procedures.
In order to achieve this, a data analysis process consists
of a pipeline of nodes, connected by edges that transport
either data or models. Each node processes the arriv-
ing data and/or model(s) and produces results on its
outputs. Figure 2 schematically illustrates this process.
The type of processing ranges from simple data opera-
tions such as filtering or merging to more complex statis-
tical functions, such as computations of mean, standard
deviation or linear regression coefficients to computa-
tion intensive data modeling operators (clustering, de-
cision trees, neural networks, to name just a few). In
addition most of the modeling nodes allow to interac-
tively explore their results through accompanying views.
In the following we will briefly describe the underlying
Proc. of the 4th Annual Industrial Simulation Conference, Workshop on Multi-Agent Systems and Simulation, Palermo, June 5-7, 2006

Figure 1: An Example Analysis Flow inside Knime.
schemata of data, node, workflow management and how
the interactive views communicate.
Data Structures
All data flowing between nodes is wrapped within a
class called DataTable which holds meta-information
concerning the type of its columns and the actual data.
The data can be accessed by iterating over instances of
DataRow. Each row contains a unique identifier (or pri-
mary key) and a specific number of DataCell objects
which hold the actual data. The reason to avoid access
by Row ID or index is scalability, that is, the desire to
be able to process large amounts of data and therefore
not be forced to keep all of the rows in memory for fast,
random access. Figure 3 shows a diagram of the main
underlying data structure.
Nodes
Nodes in Knime are the most general processing unit
and usually resemble one visual node in the workflow.
The class Node wraps all functionality and makes use
of user defined implementations of a NodeModel, possi-
bly a NodeDialog, and one or more NodeView instances
if appropriate. Neither dialog nor view must be imple-
mented if no user settings or views are needed. This
schema follows the well-known Model-View-Controller
design pattern. In addition, for the input and output
connections, each node has a number of Inport and
Outport instances which can either transport data or
model(s). Figure 4 shows a diagram of this structure.
Workflow Management
Workows in Knime are essentially graphs connecting
nodes, or more formally, a direct acyclic graph (DAG).
The WorkflowManager allows to insert new nodes and
to add directed edges (connections) between two nodes.
It also keeps track of the status of nodes (configured,
executed, ...) and returns, on demand, a pool of exe-
cutable nodes. This way the surrounding framework can
freely distribute the workload among a couple of parallel
threads or in the future even a distributed cluster
of servers. Thanks to the underlying graph structure,
the workflow manager is able to determine all nodes re-
quired to be executed along the paths leading to the
node the user actually wants to execute.

Views and Interactive Brushing
Each Node can have an arbitrary number of views as-
sociated with it. Through receiving events from a
HiLiteHandler (and sending events to it) it is possi-
ble to mark (the so-called HiLiting) selected points in
such a view to enable visual brushing. Views can range
from simple table views to more complex views on the
underlying data or the generated model.
REPOSITORY
Knime already offers a large variety of nodes, among
them are nodes for various types of data I/O, manipu-
lation, and transformation, as well as data mining and
machine learning, and visualization components:
data I/O: generic file reader, ARFF and Hitlist
file reader, database connector, CSV, Hitlist and
ARFF writer.
data manipulation: row and column filtering, data
partitioning and sampling, random shuffling or
sorting, data joiner and merger,
data transformation: missing value replacer, matrix
transposer, binners, nominal value generators
mining algorithms: clustering (k-means, sota, fuzzy
c-means), decision tree, (fuzzy) rule induction, re-
gression, subgroup and association rule mining.
machine learning: neural networks (RBF and
MLP), support vector machines
, bayes networks
and bayes classifier
statistics: via integrated R
visualization: scatter plot, histogram, parallel co-
ordinates, multidimensional scaling, rule plotters,
line and pie charts
misc: scripting nodes.
(
: via external libraries or tools).
Figure 2: A Schematic for the Flow of Data and Models
in a Knime-workflow.
Figure 3: A Class Diagram of the Data Structure and
the Main Classes it relies on.
Figure 4: A Class Diagram of the Node and the Main
Classes it relies on.
EXTENDING KNIME
Knime already includes new plug-ins to incorporate ex-
isting data analysis tools, such as Weka (Ian H. Wit-
ten and Eibe Frank 2005), the statistical toolkit R (R
Development Core Team 2005), and JFreeChart (David
Gilbert 2005). It is usually straightforward to create
wrappers for external tools without having to modify
these executables themselves. Adding new nodes to Kn-
ime, also for native new operations, is easy. For this,
one needs to extend three abstract classes:
NodeModel: this class is responsible for the main
computations. It requires to overwrite three main
methods: configure(), execute(),andreset().
The first takes the meta information of the input
tables and creates the definition of the output speci-

Citations
More filters
Journal ArticleDOI

Fiji: an open-source platform for biological-image analysis

TL;DR: Fiji is a distribution of the popular open-source software ImageJ focused on biological-image analysis that facilitates the transformation of new algorithms into ImageJ plugins that can be shared with end users through an integrated update system.
Journal ArticleDOI

ImageJ2: ImageJ for the next generation of scientific image data

TL;DR: ImageJ2 as mentioned in this paper is the next generation of ImageJ, which provides a host of new functionality and separates concerns, fully decoupling the data model from the user interface.
Journal ArticleDOI

Big Data: A Survey

TL;DR: The background and state-of-the-art of big data are reviewed, including enterprise management, Internet of Things, online social networks, medial applications, collective intelligence, and smart grid, as well as related technologies.
Journal ArticleDOI

The ImageJ ecosystem: An open platform for biomedical image analysis

TL;DR: The ImageJ project is used as a case study of how open‐source software fosters its suites of software tools, making multitudes of image‐analysis technology easily accessible to the scientific community.
Journal ArticleDOI

A new bioinformatics analysis tools framework at EMBL–EBI

TL;DR: A new framework aimed at both novice as well as expert users that exposes novel methods of obtaining annotations and visualizing sequence analysis results through one uniform and consistent interface is presented.
References
More filters
Book

Data Mining: Practical Machine Learning Tools and Techniques

TL;DR: This highly anticipated third edition of the most acclaimed work on data mining and machine learning will teach you everything you need to know about preparing inputs, interpreting outputs, evaluating results, and the algorithmic methods at the heart of successful data mining.

KNIME: The Konstanz Information Miner.

TL;DR: Some of the design aspects of the underlying architecture of the Konstanz Information Miner are described and briefly sketch how new nodes can be incorporated.
Journal ArticleDOI

The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics

TL;DR: The Chemistry Development Kit provides methods for many common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc.

Parallel and Distributed Data Pipelining with KNIME

TL;DR: The parallel and distribution potential of pipelining tools is discussed by demonstrating several parallel and distributed implementations in the open source pipeline platform KNIME and verifying the practical applicability in a number of real world experiments.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions in this paper?

In this paper the authors describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated. 

The type of processing ranges from simple data operations such as filtering or merging to more complex statistical functions, such as computations of mean, standard deviation or linear regression coefficients to computation intensive data modeling operators (clustering, decision trees, neural networks, to name just a few). 

But to accommodate the increasing availability of multi-core machines, also the support for shared memory parallelism becomesincreasingly important. 

Through receiving events from a HiLiteHandler (and sending events to it) it is possible to mark (the so-called HiLiting) selected points in such a view to enable visual brushing. 

Customized applications can be modelled through individual data pipelines.• modularity: Processing units and data containers should not depend on each other in order to enable easy distribution of computation and allow for independent development of different algorithms. 

The reason to avoid access by Row ID or index is scalability, that is, the desire to be able to process large amounts of data and therefore not be forced to keep all of the rows in memory for fast, random access. 

A meta-node can be exported to other users as a predefined module and allow to create wrappers for repeated execution as needed in cases such as, e.g. cross-validation, bagging and boosting, ensemble learning etc. 

In order to make use of the vast variety of data analysis methods around, it is essential that such an environment is easy and intuitive to use, allows for quick and interactive changes to the analysis and enables the user to visually explore the results. 

The class Node wraps all functionality and makes use of user defined implementations of a NodeModel, possibly a NodeDialog, and one or more NodeView instances if appropriate. 

In addition to the three model, dialog, and view classes the programmer also needs to provide a NodeFactory, creating new instances. 

Thanks to the underlying graph structure, the workflow manager is able to determine all nodes required to be executed along the paths leading to the node the user actually wants to execute. 

In order to achieve this, a data analysis process consists of a pipeline of nodes, connected by edges that transport either data or models. 

Such nested workflows introduce modularity and allow the user to design complex workflows while focusing on different level of details (abstraction). 

A wizard integrated in the Eclipse-based development environment allows to quickly generate all required class bodies for a new node. 

From this large variety of nodes, one can select data sources, data preprocessing steps, model building algorithms, visualization techniques as well as model I/O tools and drag them onto the workbench where they can be connected to other nodes.