scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Cloud computing for big data analytics in the Process Control Industry

01 Jul 2017-pp 1373-1378
TL;DR: The aim of this article is to present an example of a novel cloud computing infrastructure for big data analytics in the Process Control Industry, carried in close relationship with the process industry and pave a way for a generalized application of the cloud based approaches, towards the future of Industry 4.0.
Abstract: The aim of this article is to present an example of a novel cloud computing infrastructure for big data analytics in the Process Control Industry. Latest innovations in the field of Process Analyzer Techniques (PAT), big data and wireless technologies have created a new environment in which almost all stages of the industrial process can be recorded and utilized, not only for safety, but also for real time optimization. Based on analysis of historical sensor data, machine learning based optimization models can be developed and deployed in real time closed control loops. However, still the local implementation of those systems requires a huge investment in hardware and software, as a direct result of the big data nature of sensors data being recorded continuously. The current technological advancements in cloud computing for big data processing, open new opportunities for the industry, while acting as an enabler for a significant reduction in costs, making the technology available to plants of all sizes. The main contribution of this article stems from the presentation for a fist time ever of a pilot cloud based architecture for the application of a data driven modeling and optimal control configuration for the field of Process Control. As it will be presented, these developments have been carried in close relationship with the process industry and pave a way for a generalized application of the cloud based approaches, towards the future of Industry 4.0.

Summary (3 min read)

I. INTRODUCTION

  • For many years SCADA systems have been used to collect sensor data in order to control industrial processes, usually in real time [1] .
  • The overall problem becomes more complex, because of the diversity of acquired data mainly due to the: different data and sensors types, data reliability levels, measurement frequencies and missing data.
  • With the explosion of the "Internet of Things" [4] in the last decade, a world of new technologies has become readily accessible and relevant for the industrial process.
  • Cloud computing encompasses, cloud storage, and batch and streaming analysis of data using the latest Machine Learning (ML) algorithms.
  • Based on such an architecture it will be for the first time feasible to acquire and process online huge streams of data, improve the process models and correspondingly perform an online reconfiguration or re-tuning of the control scheme, in order to meet the changing demands of the process under investigation and apply platwide control techniques (see [5] , [6] ).

II. ARCHITECTURE FOR CLOUD COMPUTING

  • In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7] ).
  • Often this includes saving the data in Parquet format that can reduce the size of the data up to 90% of its original size.
  • All users were encouraged to contribute their raw batch data to the S3 repository.
  • From the S3 storage service it is feasible to collect the data onto virtual computers ("instances") implemented over the EC2 Amazon elastic computing framework, for data analysis and cleaning.
  • On these virtual computers the Hadoop cluster [8] has been installed with a Spark engine [9] for computing and an RStudio Server [10] as an analytic access point for the end-users.

B. Hadoop Cluster (HDFS)

  • Apache Hadoop is the leading open-source software framework for distributed storage and processing of Big Data [8] .
  • While Hadoop encompasses a suite of Apache software programs that help manage the tasks on the distributed system, the two core components of Hadoop are: 1) Hadoop Distributed File System (HDFS) -The system that takes very large data, breaks it down into separate pieces and distributes them to different nodes in a cluster.
  • -The computational engine that can perform analysis on the cluster.
  • HDFS was designed to store Big Data with a very high reliability and flexibility to scale up by simply adding commodity servers.
  • In the presented prototype architecture it has been utilized Hadoop as a framework for setting up the HDFS cluster on which the sensor data are stored.

C. Apache Spark Engine

  • The main feature of Apache Spark is its in-memory cluster computing that increases of the processing speed much faster than the Hadoop's MapReduce technology.
  • Spark uses HDFS for storage purpose, where calculations are performed in memory on each of the nodes.

D. Process Managers

  • At the other end of the proposed architecture are the process managers who, through local computers, can access and perform machine learning algorithms on the data stored in the Hadoop cluster.
  • The two leading programs that serve as an interface for conducting statistical analysis using the Spark engine are: 1) R -An open-source statistical language used widely both in the industry in academia.
  • 2) Python -An open-source all around language which has a vast library of functions for implementing machine learning algorithms.
  • As mentioned above, both of these coding languages have APIs that pass commands to the Spark engine.
  • The process managers access and run these programs through a number of web-based development environments and notebooks such as the Jupyter notebook, which is popular in the Python community and RStudio, which is the leading IDE amongst R users.

E. Control Feedback Loop

  • After the process managers have performed their analysis, they can set up dynamic models for implementation in the cloud that can push back responses to the industrial processes.
  • This process is explained further in the Near Real-Time Computing subsection.

F. Historical Big Data Repository

  • In the cloud, the raw data and the process manager's recommendations will be stored at the historical big data repository (AWS S3).
  • AWS offers great flexibility in storage plans that have the merit to be easily scaled as needed.

G. Near Real-time Computing

  • Apache Kafka [12] is a publish-subscribe messaging application that enables sending and receiving streaming information between the plants and the Spark engine on the cloud.
  • On the local computers (in the plants) a Kafka API (which consists of a few Java libraries) sends streaming data to a Kafka Server set up on AWS that manages the queue of information passed on to the Spark engine.
  • The Spark engine then performs the streaming analysis and pushes back the results to the Kafka server and from there back to the plants.

H. Batch Computing

  • In batch computing, the data are initially stored in the Historical Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers.
  • In many cases, this step includes saving the data in the Parquet format which can reduce the size of the data by using the R or Python languages.
  • In general, the process managers can choose from a vast array of ML algorithms that can be implemented on the cluster through the Spark engine.

III. THE USE CASE OF THE WALKING BEAM FURNACE

  • The walking beam furnace is used to re-heat slabs (large steel beams) to a specific temperature before their refinement in the steel industry (see [13] ).
  • In the described use case the main variables that need to be controlled are thus: a) the furnace temperatures in several zones of the furnace and b) the temperature of slabs at the output (the target temperature).
  • For transferring data from the sensors to the cloud, a computer connected to the WBF process is utilized that is able to manage and update the site metadata, i.e. a Mefos-Service method which run preliminary for synchronization of factory list, zone list, sensor list, bath list and model list.
  • The input messages are processed at the Kafka server by using a specific topic that it is known by both sides as the MefosService and the MefosSpark, while it requires suitable configurations e.g "ToSpark".

B. The Cloud side

  • On the clouds side (AWS) there will be the Kafka server which will receive a streaming of data and will manage the queue.
  • For the transfering of the results from the cloud back to the process, the Kafka server keeps the controls recommendations data and streams them on a specific output topic to some consumer, while the "MefosService" includes the Kafka-consumer feature that pulls the recommendations data from the output topic, e.g. "FromSpark".
  • At the Spark-Streaming, the initial data are accumulated in the memory and afterwards are saved at a historical Big Data repository.
  • In this article an example of a novel cloud computing infrastructure for big data analytics in the Process Control Industry has been presented.
  • These developments have been carried in close relationship with the process industry, since it has been presented a use case at the walking beam furnace of the Steel Industry MEFOS in Sweden.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

Cloud Computing for Big Data Analytics in the
Process Control Industry
E. Goldin
1
, D. Feldman
1
, G. Georgoulas
2
, M. Casta
˜
no
2
, and G. Nikolakopoulos
2
Abstract The aim of this article is to present an example of
a novel cloud computing infrastructure for big data analytics
in the Process Control Industry. Latest innovations in the field
of Process Analyzer Techniques (PAT), big data and wireless
technologies have created a new environment in which almost
all stages of the industrial process can be recorded and utilized,
not only for safety, but also for real time optimization. Based
on analysis of historical sensor data, machine learning based
optimization models can be developed and deployed in real time
closed control loops. However, still the local implementation
of those systems requires a huge investment in hardware and
software, as a direct result of the big data nature of sensors
data being recorded continuously. The current technological
advancements in cloud computing for big data processing, open
new opportunities for the industry, while acting as an enabler
for a significant reduction in costs, making the technology
available to plants of all sizes. The main contribution of this
article stems from the presentation for a fist time ever of a
pilot cloud based architecture for the application of a data
driven modeling and optimal control configuration for the field
of Process Control. As it will be presented, these developments
have been carried in close relationship with the process industry
and pave a way for a generalized application of the cloud based
approaches, towards the future of Industry 4.0.
I. INTRODUCTION
For many years SCADA systems have been used to collect
sensor data in order to control industrial processes, usually in
real time [1]. The topological complexity of these systems
(see [2]) involves large costs associated to scaling and
adapting to the vast amount of signals gathered for allowing
a general reconfiguration on the control structure for the
process plant (see [3]). It should be also mentioned that the
majority of these SCADA systems, up to now, have been
utilized mainly for providing an overview of the controlled
process, while having the ability to perform Process Analyzer
Techniques (PAT) mainly for the statistical processing of the
received data for an off line analysis.
However, the recent innovations in online PAT and wire-
less embedded technologies have created a new era in which
almost all stages in the industrial process can be recorded,
stored and analyzed. This process is producing a massive
amount of sampled data that need to be stored and processed
in real time for allowing an overall reconfiguration of the
The work has received funding from the European Unions Horizon
2020 Research and Innovation Programme under the Grant Agreement
No.636834, DISIRE.
Accepted version of paper with the same title published in the 25th
Mediterranean Conference on Control and Automation. Link to the pub-
lished paper: http://ieeexplore.ieee.org/document/7984310
1
GSTAT, Israel
2
Robotic Team, Division of Signal and Systems, Electrical Engineering
Department, Lule
˚
a University of Technology, Lule
˚
a, Sweden.
control plant and for achieving a continuous operational
optimality against the variations of the production stages.
Towards this vision, the industrial processes require an IT
infrastructure that could efficiently manage massive amounts
of complex data structures collected form disparate data
sources, while providing the necessary computational power
and tools for analyzing these data in batch, near and hard
real-time approaches. The overall problem becomes more
complex, because of the diversity of acquired data mainly
due to the: different data and sensors types, data reliability
levels, measurement frequencies and missing data. Moreover
in every case, the acquired data needs to be filtered, stored
and often aggregated before any meaningful analysis can be
performed.
With the explosion of the Internet of Things [4] in the
last decade, a world of new technologies has become readily
accessible and relevant for the industrial process. Nowadays,
with relatively low costs, it is possible to send torrents of data
to the ”cloud” for storage and analysis. Cloud computing en-
compasses, cloud storage, and batch and streaming analysis
of data using the latest Machine Learning (ML) algorithms.
The potential benefits of using cloud computing for dynamic
optimal control in the industrial plants include:
Dramatically reduced costs of storing and analyzing
large amounts of data
Low levels of complexity relative to existing systems
Enabling the use of advanced ML algorithms in batch
and real time
Reduces the industry entry level costs, for implementing
advanced control systems
Enabling large scale implementation with many low cost
sensors
Very easy to manage from the cloud
Easy to scale or modify storage capacities
Inspired by these capabilities of the cloud infrastructure
and the reachability of these technologies nowadays, the
proposed architecture aims to combine the existing PAT
based analysis of process that is carried in most of the times
off line, or in a batch of time samples, with the multiple
streams of sensory data describing the process and product
states. The low-dimensional data should be robust against
infrequent updates of PAT measurements and missing data,
while handling largely varying measurement intervals. The
model should also be able to handle the multivariate and
auto correlated nature of process data and the high quantities
of data from regular on line measurements. Principles from
wireless sensor networks, estimation and statistical signal

processing will be integrated and evaluated with real process
data in order to create a novel and reliable PAT based swarm
sensing and data analysis that would drive the changes in the
Integrated Process Control (IPC) industry. Based on such an
architecture it will be for the first time feasible to acquire
and process online huge streams of data, improve the process
models and correspondingly perform an online reconfigura-
tion or re-tuning of the control scheme, in order to meet
the changing demands of the process under investigation and
apply platwide control techniques (see [5], [6]). Towards this
vision, the corresponding architecture of the cloud computing
for the big data analytics will be presented that forms the
major contribution of this article. Furthermore, the proposed
technological platform will be adjusted to the use case of a
walking beam furnace.
The rest of this article is structured as it follows. In the
Section II the architecture and components of cloud com-
puting will be introduced, while in Section III a use case of
a dynamic optimal design problem that can be implemented
using the described architecture will be analyzed. Finally,
Chapter IV will conclude the article by summarizing the
benefits and limitations in using the described architecture
in the industrial process.
II. ARCHITECTURE FOR CLOUD COMPUTING
In batch computing, data is first stored in a Big Data
Repository where it can be properly cleaned, aggregated or
transformed before being analyzed by the process managers
(see [7]). Often this includes saving the data in Parquet
format that can reduce the size of the data up to 90% of
its original size.
In the proposed prototype architecture for batch processing
over the Cloud, users (industrial processes) were given access
to an Amazon web portal for S3 storage services. All users
were encouraged to contribute their raw batch data to the S3
repository. From the S3 storage service it is feasible to collect
the data onto virtual computers (”instances”) implemented
over the EC2 Amazon elastic computing framework, for
data analysis and cleaning. On these virtual computers the
Hadoop cluster [8] has been installed with a Spark engine
[9] for computing and an RStudio Server [10] as an analytic
access point for the end-users. Further access is also provided
to the virtual computers via the RStudio Server IDE, through
which they can perform ML algorithms and a vast array of
statistical analysis on the data. The overall architecture of
the proposed cloud architecture is presented in Figure 1.
In the architecture depicted in Figure 1, historical data
collected from sensors embedded in the industrial process,
are uploaded to the S3 storage on the Amazon Web Service
(AWS). After the upload the data are cleaned and prepared
for analysis on the big data framework. The process man-
agers can access this data via local computers where they can
send, develop and test their algorithms, including dynamic
optimal control algorithms on the cloud of the monitored
process.
Historical Data Repository - Users were given access to
an Amazon S3 storage facility to which they were able to
Fig. 1. Schematic Diagram of the Cloud Based Architecture
upload their historical/batch data in various formats (csv,
Json, etc.). Amazon Simple Storage Service (S3) is a web
storage interface that can facilitate storage of virtually un-
limited data bucketed into 5 terabytes in size. Furthermore,
the analytic architecture on the cloud is comprised of a
”big data” infrastructure, where the files are distributed over
several machines for storage and parallel computing and a
statistical software from which the data can be transformed
and analyzed.
A. Cloud Storage
Amazon Web Services (AWS) offers a suite of over 70
services that form an on-demand computing platform. The
two core services offered are:
1) Amazon Elastic Compute Cloud (EC2) - a virtual
computer rental service through which users can run
any software they desire and tailor the computer spec-
ifications to their specific needs. The payment scheme
is per hour of actual usage - where computers can be
”stopped” and ”started” on demand.
2) Amazon Simple Storage Service (S3) - a web storage
interface which can facilitate storage of virtually un-
limited data bucketed into 5 terabytes in size.
In the presented architecture, the utilized Amazon on-
demand platform allowed for higher flexibility in pricing and
almost instantaneous setup of our prototype architecture. It
also served as a platform where the different partners could
easily upload and access their data for further analysis.
B. Hadoop Cluster (HDFS)
Apache Hadoop is the leading open-source software
framework for distributed storage and processing of Big Data
[8]. While Hadoop encompasses a suite of Apache software
programs that help manage the tasks on the distributed
system, the two core components of Hadoop are:
1) Hadoop Distributed File System (HDFS) - The system
that takes very large data, breaks it down into separate
pieces and distributes them to different nodes (servers)
in a cluster.
2) MapReduce - The computational engine that can per-
form analysis on the cluster.

HDFS was designed to store Big Data with a very high
reliability and flexibility to scale up by simply adding com-
modity servers.
In the presented prototype architecture it has been utilized
Hadoop as a framework for setting up the HDFS cluster on
which the sensor data are stored.
C. Apache Spark Engine
The main feature of Apache Spark is its in-memory cluster
computing that increases of the processing speed much faster
than the Hadoop’s MapReduce technology. Spark uses HDFS
for storage purpose, where calculations are performed in
memory on each of the nodes. Aside from the increased
speed in computation, the Spark engine is able to:
Provide built-in APIs for multiple languages: Java,
Scala, Python and R
Spark-SQL for querying big data with SQL liked code
Spark-MLlib [11] for big data parallel machine learning
algorithms like linear and logistic regression, clustering
K-means, decision trees, random forest, neural network,
recommendation engine and more
Spark-Streaming for calculating machine learning algo-
rithms on streaming data
D. Process Managers
At the other end of the proposed architecture are the
process managers who, through local computers, can access
and perform machine learning algorithms on the data stored
in the Hadoop cluster. The two leading programs that serve
as an interface for conducting statistical analysis using the
Spark engine are:
1) R - An open-source statistical language used widely
both in the industry in academia.
2) Python - An open-source all around language which
has a vast library of functions for implementing ma-
chine learning algorithms.
As mentioned above, both of these coding languages have
APIs that pass commands to the Spark engine. The process
managers access and run these programs through a number
of web-based development environments and notebooks such
as the Jupyter notebook, which is popular in the Python
community and RStudio, which is the leading IDE amongst
R users.
E. Control Feedback Loop
After the process managers have performed their analysis,
they can set up dynamic models for implementation in
the cloud that can push back responses to the industrial
processes. This process is explained further in the Near Real-
Time Computing subsection.
F. Historical Big Data Repository
In the cloud, the raw data and the process manager’s
recommendations will be stored at the historical big data
repository (AWS S3). AWS offers great flexibility in storage
plans that have the merit to be easily scaled as needed.
G. Near Real-time Computing
Apache Kafka [12] is a publish-subscribe messaging
application that enables sending and receiving streaming
information between the plants and the Spark engine on the
cloud. On the local computers (in the plants) a Kafka API
(which consists of a few Java libraries) sends streaming data
to a Kafka Server set up on AWS that manages the queue of
information passed on to the Spark engine. The Spark engine
then performs the streaming analysis and pushes back the
results to the Kafka server and from there back to the plants.
The analysis can be either cleaning of the data, searching
for outliers or implementing a ML algorithm in real-time.
In addition, every 10 minutes the Spark server sends the
accumulated data to the Historical Big Data Repository for
future use or for batch computing.
H. Batch Computing
In batch computing, the data are initially stored in the
Historical Big Data Repository where it can be properly
cleaned, aggregated or transformed before being analyzed
by the process managers. In many cases, this step includes
saving the data in the Parquet format which can reduce the
size of the data by using the R or Python languages. In
general, the process managers can choose from a vast array
of ML algorithms that can be implemented on the cluster
through the Spark engine.
III. THE USE CASE OF THE WALKING BEAM FURNACE
The walking beam furnace is used to re-heat slabs (large
steel beams) to a specific temperature before their refinement
in the steel industry (see [13]). The slabs are walked from
the feed to the output of the furnace by the cyclic movement
of so-called walking beams. During this passage, the items
are directly exposed to the heat produced by burners located
inside the furnace. Since the heat distribution affects the qual-
ity of the finished product, a natural optimal control problem
in this context is to regulate pre-assigned temperatures at
specific points of the furnace, while minimizing the energy
expenditure for the heat generation (see [14], [15]).
The walking beam furnace at MEFOS is an experimental
furnace and lacks some of the features of an industrial
furnace. Specifically, the temperatures throughout the furnace
are not feedback controlled (as it is otherwise customary in
the industry), i.e., the furnace operates open loop. Currently,
a human operator configures the furnace set-points manually
(the set-point values are, however, computed numerically)
and then measures the slabs temperature at the furnace
exit using a pyrometer. In fact, under normal operating
conditions, the open-loop control can be tuned to work well.
Additionally, this industrial installation is affected by stops
and other variations that influence the control performance
and correspondingly the need for a feedback control loop.
In the described use case the main variables that need to be
controlled are thus: a) the furnace temperatures in several
zones of the furnace and b) the temperature of slabs at the
output (the target temperature). Furthermore, the main ob-
jective is to reduce the operating costs through the reduction

of energy consumption. In this respect, a small decrease in
energy consumption such as 0.5% translates into a saving
of 2kWh per ton of heated product, while optimal control
strategies could lead to quality improvements as well. The
overall schematic diagram of the WBF with the indicative
control loops, the sensors and the different heating zones is
depicted in Figure 2.
Fig. 2. Schematic Diagram of the Walking Beam Furnace
To achieve these goals there is a need to gather more
information about the process on-line, while the optimal
controls output would optimize the process by controlling
the following variables: 1) the fuel supply rate at the burners,
one burner at each zone, total of three burners, 2) the fuel
atomization air supply rate, one for each burner, 3) the
combustion air flow, one at each zone, total of three zones,
and 4) the exhaust flow, e.g. exhaust damper position, one
exhaust damper in the furnace.
In this use case, MEFOS has installed a dedicating PC in
the WBF site for managing the flow of the measurements
data. Figure 3 presents the flow of the sensory data from
ABB control system to the connectivity server and from there
to the corresponding PC and in the sequel to the cloud.
Fig. 3. Cloud Based Implemented Architecture of the WBF
In the presented use case it is intended to stream the data
on-line, near real-time from the process by using the Kafka-
producer component, to Kafka service in the cloud, while the
Apache Kafka publishes-subscribes messaging applications.
In the cloud the data will be pulled by the Kafka-consumer
that will be implemented at the Spark cluster. At the cluster,
the data will be verified, cleaned, aggregated, organized and
sent to the optimal control system to determine recommen-
dations. Afterwards the optimizer’s recommendations will
be pushed back to Kafka, while the corresponding gateway
will determine the fuel supply rate at the burners, the fuel
atomization air supply rate, the combustion air flow and the
exhaust flow. In the cloud the raw data and the optimizers
recommendations will be stored at historical big data repos-
itory (AWS S3). The overall schematic representation of the
presented architecture is depicted in Figure 4.
For this usecase, the variables required from the optimal
control module are the following ones in Figure 5:
The minimum data input for the optimal control is 200
past values of the averages every 10 seconds of the above
parameters (one value every 10 seconds in the last 2,000
seconds, i.e. 33 minute and 20 seconds) is required.
A. Transferring data from the sensors to the cloud
For transferring data from the sensors to the cloud, a
computer connected to the WBF process is utilized that is
able to manage and update the site metadata, i.e. a Mefos-
Service method which run preliminary for synchronization
of factory list, zone list, sensor list, bath list and model
list. Furthermore, this method create a file in json structure
with 3 fields: FactoryID, ZoneID, SensorID in Every possible
values, while the posted data can be either a single message
or array. The input messages are processed at the Kafka
TABLE I
MESSAGE TYPES
Message Type 1 - Process Status Change
Factory ID F Key [Predifined Integer]
Batch ID F key [Predefined Integer]
Status ID P key [running Integer]
Date time [Time Stamp]
Current Status [Predefined String:Idle/Start/Stop/Pause/Restart ]
Message Type 2 - Measurements
Factory ID F key [Predefined Integer: -1 / 1 / 2 / 3 / ]
Zone ID F key [Predefined Integer]
Sensor ID F key [Predefined Integer]
Batch ID F key [Predefined Integer]
Date Time [Time Stamp]
Measurement value [Double]
Measurement unit [Char: C/ % / m
3
/h / kg/h / MMWC / Boolean ]
Quality [Integer]
server by using a specific topic that it is known by both
sides as the MefosService and the MefosSpark, while it
requires suitable configurations e.g ”ToSpark”. The Kafka
API provides a callback method which verifies the input
streaming received on Kafka server. The POST method ”/
SendMeashurements” uses this API to evaluates any loss, if
there is some.
B. The Cloud side
On the clouds side (AWS) there will be the Kafka server
which will receive a streaming of data and will manage
the queue. Overall the data will be routed through the
Kafka server into the Spark cluster and from there back
to Kafka. As mentioned before, the Kafka server will be
held responsible for managing the messages that arrive
from MefosService. The Spark streaming process consumes
measurements data from the Kafka server, store it in the
memory, and feeds the relevant process models at 10 sec.
In every batch intervals the process receives the recommen-
dations per measurement type from each model and sends

Fig. 4. Schematic description of the Architecture
Fig. 5. Variables required by the optimal control module
the recommendations to the Kafka server. In the sequel, the
Spark streaming process saves the measurements data along
with the recommendations to AWS S3. Overall, the streaming
process is depicted in Figure 6.
The Kafka server will also keep and be responsible for
the recommendations data queue that it is arrived from the
Spark cluster. For the transfering of the results from the
cloud back to the process, the Kafka server keeps the controls
recommendations data and streams them on a specific output
topic to some consumer, while the ”MefosService” includes
the Kafka-consumer feature that pulls the recommendations
data from the output topic, e.g. ”FromSpark”. Finally, the
output recommendations are reaching to the Web-API of the
Fig. 6. Overview of the Streaming Process
process by a provided URL.
For the big data repository, the Spark-Streaming process
metadata are synchronized and pre-processed. After this step
the data are being pushed from the Mefos-Service PC into the
Kafka server and from there are pulled by the Spark cluster.
At the Spark-Streaming, the initial data are accumulated
in the memory and afterwards are saved at a historical
Big Data repository. The Controls recommendations data
are also accumulated at the memory and are saved at the
historical Big Data repository that relies at the AWS S3
(Amazon Simple Storage Service), while the files will be
saved as Parquet file type with the following benefits: 1)
The structure of the table, i.e. the number of the columns,
their types and the delimiter between columns, will be saved,
2) the data are compressed, a fact that saves about 60%
of its volume compared to text file type, and 3) it enables

Citations
More filters
Proceedings ArticleDOI
02 Jul 2018
TL;DR: A platform concept is depicted, which combines cloud computing and industrial control using edge devices realized for an automation cell, which opens up new potentials in the industrial sector.
Abstract: In the past, industrial control of field devices was comprised of self-contained systems in a dedicated network for exchanging control information between field devices and control hardware to accomplish process tasks. Nowadays, cloud computing enables a massive amount of computing resources and high availability, which opens up new potentials in the industrial sector. Until now, the integration of cloud solutions in industrial control was limited due to missing technologies connecting the Internet of Things with industrial requirements. Furthermore, based on existing paradigms there is a lack of appropriate architecture concepts for industrial control. This paper depicts a platform concept, which combines cloud computing and industrial control using edge devices realized for an automation cell.

25 citations


Cites background from "Cloud computing for big data analyt..."

  • ...Most of the recent research approaches examine cloud-based solutions for process data exchange and analytics as well as optimization [2, 3]....

    [...]

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper analyzed the uncontrollable cyber threats and classified attack characteristics, and elaborated the intrinsic vulnerabilities in current networked control systems and novel security challenges in future Industrial Internet.
Abstract: Due to the deep integration of information technology and operational technology, networked control systems are experiencing an increasing risk of international cyber attacks In practice, industrial cyber security is a significant topic because current networked control systems are supporting various critical infrastructures to offer vital utility services By comparing with traditional IT systems, this paper first analyzes the uncontrollable cyber threats and classified attack characteristics, and elaborates the intrinsic vulnerabilities in current networked control systems and novel security challenges in future Industrial Internet After that, in order to overcome partial vulnerabilities, this paper presents a few representative security mechanisms which have been successfully applied in today's industrial control systems, and these mechanisms originally improve traditional IT defense technologies from the perspective of industrial availability Finally, several popular security viewpoints, adequately covering the needs of industrial network structures and service characteristics, are proposed to combine with burgeoning industrial information technologies We target to provide some helpful security guidelines for both academia and industry, and hope that our insights can further promote in-depth development of industrial cyber security

7 citations

Journal ArticleDOI
12 Feb 2020
TL;DR: A novel cloud-client integrative industrial Internet architecture and solutions for related key technologies are proposed and demonstrated for some specific applications in the field of intelligent manufacturing.
Abstract: The fourth industrial revolution has been unveiled with the rapid development of Internet of things (IoT), cloud computing, and big data. The industrial Internet, as a highly cooperative and intelligence-sharing global network that connects entities, human beings, and the environment in smart manufacturing, is the core of this revolution. However, most current research on the industrial Internet is restricted to IoT, cloud computing, or big data, respectively. The synergy between the cloud and clients is currently at a very primary stage of sensing, connection, and knowledge, lacking a cloud-client-integrative architecture and key technologies that could meet the evolving requirements of networked smart production, including more complex objects to be sensed, more diversification of entities to be connected, faster data processing, and more intelligent feedback control. This paper first surveys some important research directions with respect to this research field and summarizes the research status and challenges. On this basis, a novel cloud-client integrative industrial Internet architecture and solutions for related key technologies are proposed. Then, the proposed technologies are demonstrated for some specific applications in the field of intelligent manufacturing. Finally, the prospects for cloud-client-integrative industrial Internet research are discussed and concluded.

6 citations

Proceedings ArticleDOI
01 Nov 2017
TL;DR: A cloud-extended sensor network with supervisory control in a public cloud with simple system architecture and cost savings with the use of low-cost sensors and cloud resources is presented.
Abstract: The current automation supervisory control systems are situated in well-restricted areas and require investments in computing hardware and communication systems. In machine automation systems, any additional computing hardware can be cumbersome to install, making upgrades hard to apply. This paper presents a cloud-extended sensor network with supervisory control in a public cloud. The hardware and cloud resources used in the solution are low-cost, reducing the up-front costs compared to the use of high-end components. The system collects data from ST microprocessor (STM)-based sensor nodes that send inertial measurement data using user datagram protocol (UDP). The sensor itself is a Bosch BMI160, a cheap and small inertial measurement unit (IMU). The system is designed to be used in machine automation applications where the frequency of the sensory data produced is hundreds of hertz. The system is to provide low-latency data transfer to the cloud. In the cloud environment, data is collected by a computing service that can be programmed to perform algorithms on it. The system is tested in a setup consisting of five IMU sensors and an angle measurement unit attached to a hydraulically actuated flexible beam. The test setup aims to update a local control system's parameters based on a cloud algorithm and camera measurements of the beam tip position. The control results and communication latency are inspected. The main advantages of the proposed solution are the simple system architecture and cost savings with the use of low-cost sensors and cloud resources. The focus of this study is the functionality of such a system; intricate security issues are beyond the scope of this study.

6 citations

Proceedings ArticleDOI
20 Nov 2019
TL;DR: This work designs and implements Agni1, an efficient, distributed, dual-access object storage file system (OSFS), that uses standard object storage APIs and cloud microservices, and overcomes the performance shortcomings of existing approaches by implementing a multi-tier write aggregating data structure and by integrating with existing cloud-native services.
Abstract: Object storage is a low-cost, scalable component of cloud ecosystems. However, interface incompatibilities and performance limitations inhibit its adoption for emerging cloud-based workloads. Users are compelled to either run their applications over expensive block storage-based file systems or use inefficient file connectors over object stores. Dual access, the ability to read and write the same data through file systems interfaces and object storage APIs, has promise to improve performance and eliminate storage sprawl. We design and implement Agni1, an efficient, distributed, dual-access object storage file system (OSFS), that uses standard object storage APIs and cloud microservices. Our system overcomes the performance shortcomings of existing approaches by implementing a multi-tier write aggregating data structure and by integrating with existing cloud-native services. Moreover, Agni provides distributed access and a coherent namespace. Our experiments demonstrate that for representative workloads Agni improves performance by 20%--60% when compared with existing approaches.

5 citations


Cites background from "Cloud computing for big data analyt..."

  • ...These include neuroscience [28, 40, 43], geoinformatics [69, 74], machine learning [39], industrial process control [26], and computer vision [24, 64]....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: This survey is directed to those who want to approach this complex discipline and contribute to its development, and finds that still major issues shall be faced by the research community.

12,539 citations


"Cloud computing for big data analyt..." refers background in this paper

  • ...With the explosion of the “Internet of Things” [4] in the last decade, a world of new technologies has become readily accessible and relevant for the industrial process....

    [...]

Proceedings Article
22 Jun 2010
TL;DR: Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.
Abstract: MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. However, most of these systems are built around an acyclic data flow model that is not suitable for other popular applications. This paper focuses on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis tools. We propose a new framework called Spark that supports these applications while retaining the scalability and fault tolerance of MapReduce. To achieve these goals, Spark introduces an abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

4,959 citations


"Cloud computing for big data analyt..." refers methods in this paper

  • ...[9] for computing and an RStudio Server [10] as an analytic access point for the end-users....

    [...]

Proceedings ArticleDOI
17 Aug 2012
TL;DR: This paper argues that the above characteristics make the Fog the appropriate platform for a number of critical Internet of Things services and applications, namely, Connected Vehicle, Smart Grid, Smart Cities, and, in general, Wireless Sensors and Actuators Networks (WSANs).
Abstract: Fog Computing extends the Cloud Computing paradigm to the edge of the network, thus enabling a new breed of applications and services. Defining characteristics of the Fog are: a) Low latency and location awareness; b) Wide-spread geographical distribution; c) Mobility; d) Very large number of nodes, e) Predominant role of wireless access, f) Strong presence of streaming and real time applications, g) Heterogeneity. In this paper we argue that the above characteristics make the Fog the appropriate platform for a number of critical Internet of Things (IoT) services and applications, namely, Connected Vehicle, Smart Grid, Smart Cities, and, in general, Wireless Sensors and Actuators Networks (WSANs).

4,440 citations

Book
29 May 2009
TL;DR: This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoops clusters.
Abstract: Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters. Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you: Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence Discover common pitfalls and advanced features for writing real-world MapReduce programs Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud Use Pig, a high-level query language for large-scale data processing Take advantage of HBase, Hadoop's database for structured and semi-structured data Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject. "Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!

3,797 citations


"Cloud computing for big data analyt..." refers background or methods in this paper

  • ...framework for distributed storage and processing of Big Data [8]....

    [...]

  • ...The Controls recommendations data are also accumulated at the memory and are saved at the historical Big Data repository that relies at the AWS S3 (Amazon Simple Storage Service), while the files will be saved as Parquet file type with the following benefits: 1) The structure of the table, i.e. the number of the columns, their types and the delimiter between columns, will be saved, 2) the data are compressed, a fact that saves about 60% of its volume compared to text file type, and 3) it enables the straight upload into Spark in memory data storage, no conversions will be needed....

    [...]

  • ...Furthermore, the historical Big Data repository will enable deep investigation of the data in case it is required for the development of new models, such as the BI reports, etc....

    [...]

  • ...Apache Hadoop is the leading open-source software framework for distributed storage and processing of Big Data [8]....

    [...]

  • ...In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7])....

    [...]

Book ChapterDOI
01 Jan 2019
TL;DR: This chapter argues that the above characteristics make the Fog the appropriate platform for a number of critical internet of things services and applications, namely connected vehicle, smart grid, smart cities, and in general, wireless sensors and actuators networks (WSANs).
Abstract: Fog computing extends the cloud computing paradigm to the edge of the network, thus enabling a new breed of applications and services. Defining characteristics of the Fog are 1) low latency and location awareness, 2) widespread geographical distribution, 3) mobility, 4) very large number of nodes, 5) predominant role of wireless access, 6) strong presence of streaming and real time applications, and 7) heterogeneity. In this chapter, the authors argue that the above characteristics make the Fog the appropriate platform for a number of critical internet of things (IoT) services and applications, namely connected vehicle, smart grid, smart cities, and in general, wireless sensors and actuators networks (WSANs).

2,384 citations


Additional excerpts

  • ...In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7])....

    [...]

Frequently Asked Questions (18)
Q1. What are the contributions mentioned in the paper "Cloud computing for big data analytics in the process control industry" ?

The aim of this article is to present an example of a novel cloud computing infrastructure for big data analytics in the Process Control Industry. The main contribution of this article stems from the presentation for a fist time ever of a pilot cloud based architecture for the application of a data driven modeling and optimal control configuration for the field of Process Control. 

Part of the future work includes the full extended experimentation and validation of the proposed scheme in WBF campaigns. 

The main feature of Apache Spark is its in-memory cluster computing that increases of the processing speed much faster than the Hadoop’s MapReduce technology. 

Since the heat distribution affects the quality of the finished product, a natural optimal control problem in this context is to regulate pre-assigned temperatures at specific points of the furnace, while minimizing the energy expenditure for the heat generation (see [14], [15]). 

The walking beam furnace is used to re-heat slabs (large steel beams) to a specific temperature before their refinement in the steel industry (see [13]). 

The Spark streaming process consumes measurements data from the Kafka server, store it in the memory, and feeds the relevant process models at 10 sec. 

While Hadoop encompasses a suite of Apache software programs that help manage the tasks on the distributed system, the two core components of Hadoop are:1) Hadoop Distributed File System (HDFS) - 

On the local computers (in the plants) a Kafka API (which consists of a few Java libraries) sends streaming data to a Kafka Server set up on AWS that manages the queue of information passed on to the Spark engine. 

Apache Kafka [12] is a publish-subscribe messaging application that enables sending and receiving streaming information between the plants and the Spark engine on the cloud. 

In the presented use case it is intended to stream the data on-line, near real-time from the process by using the Kafkaproducer component, to Kafka service in the cloud, while the Apache Kafka publishes-subscribes messaging applications. 

In the presented prototype architecture it has been utilized Hadoop as a framework for setting up the HDFS cluster on which the sensor data are stored. 

The current technological advancements in cloud computing for big data processing, open new opportunities for the industry, while acting as an enabler for a significant reduction in costs, making the technology available to plants of all sizes. 

In the proposed prototype architecture for batch processing over the Cloud, users (industrial processes) were given access to an Amazon web portal for S3 storage services. 

In addition, every 10 minutes the Spark server sends the accumulated data to the Historical Big Data Repository for future use or for batch computing. 

For transferring data from the sensors to the cloud, a computer connected to the WBF process is utilized that is able to manage and update the site metadata, i.e. a MefosService method which run preliminary for synchronization of factory list, zone list, sensor list, bath list and model list. 

the historical Big Data repository will enable deep investigation of the data in case it is required for the development of new models, such as the BI reports, etc. 

For this usecase, the variables required from the optimal control module are the following ones in Figure 5:The minimum data input for the optimal control is 200 past values of the averages every 10 seconds of the above parameters (one value every 10 seconds in the last 2,000 seconds, i.e. 33 minute and 20 seconds) is required. 

The input messages are processed at the Kafkaserver by using a specific topic that it is known by both sides as the MefosService and the MefosSpark, while it requires suitable configurations e.g ”ToSpark”.