What is the main purpose of Knime?

But to accommodate the increasing availability of multi-core machines, also the support for shared memory parallelism becomesincreasingly important.

What is the function of a meta-node?

A meta-node can be exported to other users as a predefined module and allow to create wrappers for repeated execution as needed in cases such as, e.g. cross-validation, bagging and boosting, ensemble learning etc.

What is the purpose of the class Node?

The class Node wraps all functionality and makes use of user defined implementations of a NodeModel, possibly a NodeDialog, and one or more NodeView instances if appropriate.

What is the need for a NodeFactory?

In addition to the three model, dialog, and view classes the programmer also needs to provide a NodeFactory, creating new instances.

What is the purpose of nested workflows?

Such nested workflows introduce modularity and allow the user to design complex workflows while focusing on different level of details (abstraction).

What is the way to create a new node?

A wizard integrated in the Eclipse-based development environment allows to quickly generate all required class bodies for a new node.

(Open Access) KNIME: The Konstanz Information Miner (2008) | Michael R. Berthold

Q: What are the contributions in this paper?

In this paper the authors describe some of the design aspects of the underlying architecture and briefly sketch how new nodes can be incorporated.

Q: What type of processing is easy to extend?

The type of processing ranges from simple data operations such as filtering or merging to more complex statistical functions, such as computations of mean, standard deviation or linear regression coefficients to computation intensive data modeling operators (clustering, decision trees, neural networks, to name just a few).

KNIME: the Konstanz Information Miner

Conference or Workshop Item

Accepted Version

Berthold, M. R., Cebron, N., Dill, F., Di Fatta, G., Gabriel, T.

R., Georg, F., Meinl, T., Ohl, P., Sieb, C. and Wiswedel, B.

(2006) KNIME: the Konstanz Information Miner. In: Workshop

on Multi-Agent Systems and Simulation (MAS&S), 4th Annual

Industrial Simulation Conference (ISC), 05-07 June 2006,

Palermo, Italy, pp. 58-61. Available at

https://centaur.reading.ac.uk/6139/

It is advisable to refer to the publisher’s version if you intend to cite from the

work. See Guidance on citing

All outputs in CentAUR are protected by Intellectual Property Rights law,

including copyright law. Copyright and IPR is retained by the creators or other

the End User Agreement

www.reading.ac.uk/centaur

CentAUR

Central Archive at the University of Reading

Reading’s research outputs online

KNIME: THE KONSTANZ INFORMATION MINER

Michael R. Berthold, Nicolas Cebron, Fabian Dill, Giuseppe Di Fatta

∗

,ThomasR.Gabriel,

Florian Georg, Thorsten Meinl, Peter Ohl,

Christoph Sieb and Bernd Wiswedel

Konstanz University, Department of Computer and Information Science

Fach M712, 78457 Konstanz, Germany

E-mail: Michael.Berthold@uni-konstanz.de

ABSTRACT

The Konstanz Information Miner is a modular environ-

ment which enables easy visual assembly and interactive

execution of a data pipeline. It is designed as a teaching,

research and collaboration platform, which enables easy

integration of new algorithms, data manipulation or vi-

sualization methods as new modules or nodes. In this

paper we describe some of the design aspects of the un-

derlying architecture and brieﬂy sketch how new nodes

can be incorporated.

OVERVIEW

Large volumes of data are often generated during simu-

lations and the need for modular data analysis environ-

ments has increased dramatically over the past years.

In order to make use of the vast variety of data analysis

methods around, it is essential that such an environ-

ment is easy and intuitive to use, allows for quick and

interactive changes to the analysis and enables the user

to visually explore the results. To meet these challenges

a data pipelining environment is an appropriate model.

It allows the user to visually assemble and adapt the

analysis ﬂow from standardized building blocks, at the

same time oﬀering an intuitive, graphical way to docu-

ment what has been done.

Knime, the Konstanz Information Miner provides such

an environment. Figure 1 shows a screenshot of an ex-

ample analysis ﬂow. In the center, a ﬂow is reading in

data from three sources and processes it in several, also

parallel analysis ﬂows, consisting of preprocessing, mod-

eling, and visualization nodes. On the left a repository

of nodes is shown. From this large variety of nodes,

one can select data sources, data preprocessing steps,

model building algorithms, visualization techniques as

well as model I/O tools and drag them onto the work-

bench where they can be connected to other nodes. The

ability to have all views interact graphically creates a

powerful environment to explore the data sets at hand.

Knime is written in Java and it’s graphical workﬂow ed-

itor is implemented as an Eclipse (Eclipse Foundation

∗

G. Di Fatta is also with ICAR-CNR, National Research Coun-

cil, Palermo, Italy.

2005) plug-in. It is easy to extend through an open API

and a data abstraction framework, which allows for new

nodes to be quickly added in a well-deﬁned way.

In this paper we will describe some of the internals of

Knime in more detail. More information as well as down-

loads can be found at http://www.knime.org.

ARCHITECTURE

The architecture of Knime was designed with three main

principles in mind:

• visual, interactive framework: data ﬂows should be

combined by simple drag&drop from a variety of

processing units. Customized applications can be

modelled through individual data pipelines.

• modularity: Processing units and data containers

should not depend on each other in order to enable

easy distribution of computation and allow for inde-

pendent development of diﬀerent algorithms. Data

Types are encapsulated, that is, no types are prede-

ﬁned, new types can easily be added bringing along

type speciﬁc renderers and comparators. New types

can be declared compatible to existing types.

• easy expandability: It should be easy to add new

processing nodes, or views and distribute them

through a simple plug&play principle without the

need for complicated install/deinstall procedures.

In order to achieve this, a data analysis process consists

of a pipeline of nodes, connected by edges that transport

either data or models. Each node processes the arriv-

ing data and/or model(s) and produces results on its

outputs. Figure 2 schematically illustrates this process.

The type of processing ranges from simple data opera-

tions such as ﬁltering or merging to more complex statis-

tical functions, such as computations of mean, standard

deviation or linear regression coeﬃcients to computa-

tion intensive data modeling operators (clustering, de-

cision trees, neural networks, to name just a few). In

addition most of the modeling nodes allow to interac-

tively explore their results through accompanying views.

In the following we will brieﬂy describe the underlying

Proc. of the 4th Annual Industrial Simulation Conference, Workshop on Multi-Agent Systems and Simulation, Palermo, June 5-7, 2006

Figure 1: An Example Analysis Flow inside Knime.

schemata of data, node, workﬂow management and how

the interactive views communicate.

Data Structures

All data ﬂowing between nodes is wrapped within a

class called DataTable which holds meta-information

concerning the type of its columns and the actual data.

The data can be accessed by iterating over instances of

DataRow. Each row contains a unique identiﬁer (or pri-

mary key) and a speciﬁc number of DataCell objects

which hold the actual data. The reason to avoid access

by Row ID or index is scalability, that is, the desire to

be able to process large amounts of data and therefore

not be forced to keep all of the rows in memory for fast,

random access. Figure 3 shows a diagram of the main

underlying data structure.

Nodes

Nodes in Knime are the most general processing unit

and usually resemble one visual node in the workﬂow.

The class Node wraps all functionality and makes use

of user deﬁned implementations of a NodeModel, possi-

bly a NodeDialog, and one or more NodeView instances

if appropriate. Neither dialog nor view must be imple-

mented if no user settings or views are needed. This

schema follows the well-known Model-View-Controller

design pattern. In addition, for the input and output

connections, each node has a number of Inport and

Outport instances which can either transport data or

model(s). Figure 4 shows a diagram of this structure.

Workﬂow Management

Workﬂows in Knime are essentially graphs connecting

nodes, or more formally, a direct acyclic graph (DAG).

The WorkflowManager allows to insert new nodes and

to add directed edges (connections) between two nodes.

It also keeps track of the status of nodes (conﬁgured,

executed, ...) and returns, on demand, a pool of exe-

cutable nodes. This way the surrounding framework can

freely distribute the workload among a couple of parallel

threads or – in the future – even a distributed cluster

of servers. Thanks to the underlying graph structure,

the workﬂow manager is able to determine all nodes re-

quired to be executed along the paths leading to the

node the user actually wants to execute.

Views and Interactive Brushing

Each Node can have an arbitrary number of views as-

sociated with it. Through receiving events from a

HiLiteHandler (and sending events to it) it is possi-

ble to mark (the so-called HiLiting) selected points in

such a view to enable visual brushing. Views can range

from simple table views to more complex views on the

underlying data or the generated model.

REPOSITORY

Knime already oﬀers a large variety of nodes, among

them are nodes for various types of data I/O, manipu-

lation, and transformation, as well as data mining and

machine learning, and visualization components:

• data I/O: generic ﬁle reader, ARFF and Hitlist

ﬁle reader, database connector, CSV, Hitlist and

ARFF writer.

• data manipulation: row and column ﬁltering, data

partitioning and sampling, random shuﬄing or

sorting, data joiner and merger,

• data transformation: missing value replacer, matrix

transposer, binners, nominal value generators

• mining algorithms: clustering (k-means, sota, fuzzy

c-means), decision tree, (fuzzy) rule induction, re-

gression, subgroup and association rule mining.

• machine learning: neural networks (RBF and

MLP), support vector machines



, bayes networks

and bayes classiﬁer



• statistics: via integrated R



• visualization: scatter plot, histogram, parallel co-

ordinates, multidimensional scaling, rule plotters,

line and pie charts



• misc: scripting nodes.

(



: via external libraries or tools).

Figure 2: A Schematic for the Flow of Data and Models

in a Knime-workﬂow.

Figure 3: A Class Diagram of the Data Structure and

the Main Classes it relies on.

Figure 4: A Class Diagram of the Node and the Main

Classes it relies on.

EXTENDING KNIME

Knime already includes new plug-ins to incorporate ex-

isting data analysis tools, such as Weka (Ian H. Wit-

ten and Eibe Frank 2005), the statistical toolkit R (R

Development Core Team 2005), and JFreeChart (David

Gilbert 2005). It is usually straightforward to create

wrappers for external tools without having to modify

these executables themselves. Adding new nodes to Kn-

ime, also for native new operations, is easy. For this,

one needs to extend three abstract classes:

• NodeModel: this class is responsible for the main

computations. It requires to overwrite three main

methods: configure(), execute(),andreset().

The ﬁrst takes the meta information of the input

tables and creates the deﬁnition of the output speci-

KNIME: The Konstanz Information Miner

Figures

Citations

Fiji: an open-source platform for biological-image analysis

ImageJ2: ImageJ for the next generation of scientific image data

Big Data: A Survey

The ImageJ ecosystem: An open platform for biomedical image analysis

A new bioinformatics analysis tools framework at EMBL–EBI

References

Data Mining: Practical Machine Learning Tools and Techniques

KNIME: The Konstanz Information Miner.

The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics

Parallel and Distributed Data Pipelining with KNIME

Related Papers (5)

The WEKA data mining software: an update

Random Forests

Fiji: an open-source platform for biological-image analysis

The Protein Data Bank

CellProfiler: image analysis software for identifying and quantifying cell phenotypes

Frequently Asked Questions (15)

Q1. What are the contributions in this paper?

Q2. What type of processing is easy to extend?

Q3. What is the main purpose of Knime?

Q4. What is the function that can be used to mark a selected point in a view?

Q5. What are the main principles of Knime?

Q6. Why is it important to avoid accessing data by row ID?

Q7. What is the function of a meta-node?

Q8. What is the main purpose of the Konstanz Information Miner?

Q9. What is the purpose of the class Node?

Q10. What is the need for a NodeFactory?

Q11. What is the reason to avoid accessing data by a row?

Q12. What is the main idea of a data analysis process?

Q13. What is the purpose of nested workflows?

Q14. What is the way to create a new node?

Q15. What is the purpose of the Konstanz Information Miner?