Cloud computing for big data analytics in the Process Control Industry
Summary (3 min read)
I. INTRODUCTION
- For many years SCADA systems have been used to collect sensor data in order to control industrial processes, usually in real time [1] .
- The overall problem becomes more complex, because of the diversity of acquired data mainly due to the: different data and sensors types, data reliability levels, measurement frequencies and missing data.
- With the explosion of the "Internet of Things" [4] in the last decade, a world of new technologies has become readily accessible and relevant for the industrial process.
- Cloud computing encompasses, cloud storage, and batch and streaming analysis of data using the latest Machine Learning (ML) algorithms.
- Based on such an architecture it will be for the first time feasible to acquire and process online huge streams of data, improve the process models and correspondingly perform an online reconfiguration or re-tuning of the control scheme, in order to meet the changing demands of the process under investigation and apply platwide control techniques (see [5] , [6] ).
II. ARCHITECTURE FOR CLOUD COMPUTING
- In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7] ).
- Often this includes saving the data in Parquet format that can reduce the size of the data up to 90% of its original size.
- All users were encouraged to contribute their raw batch data to the S3 repository.
- From the S3 storage service it is feasible to collect the data onto virtual computers ("instances") implemented over the EC2 Amazon elastic computing framework, for data analysis and cleaning.
- On these virtual computers the Hadoop cluster [8] has been installed with a Spark engine [9] for computing and an RStudio Server [10] as an analytic access point for the end-users.
B. Hadoop Cluster (HDFS)
- Apache Hadoop is the leading open-source software framework for distributed storage and processing of Big Data [8] .
- While Hadoop encompasses a suite of Apache software programs that help manage the tasks on the distributed system, the two core components of Hadoop are: 1) Hadoop Distributed File System (HDFS) -The system that takes very large data, breaks it down into separate pieces and distributes them to different nodes in a cluster.
- -The computational engine that can perform analysis on the cluster.
- HDFS was designed to store Big Data with a very high reliability and flexibility to scale up by simply adding commodity servers.
- In the presented prototype architecture it has been utilized Hadoop as a framework for setting up the HDFS cluster on which the sensor data are stored.
C. Apache Spark Engine
- The main feature of Apache Spark is its in-memory cluster computing that increases of the processing speed much faster than the Hadoop's MapReduce technology.
- Spark uses HDFS for storage purpose, where calculations are performed in memory on each of the nodes.
D. Process Managers
- At the other end of the proposed architecture are the process managers who, through local computers, can access and perform machine learning algorithms on the data stored in the Hadoop cluster.
- The two leading programs that serve as an interface for conducting statistical analysis using the Spark engine are: 1) R -An open-source statistical language used widely both in the industry in academia.
- 2) Python -An open-source all around language which has a vast library of functions for implementing machine learning algorithms.
- As mentioned above, both of these coding languages have APIs that pass commands to the Spark engine.
- The process managers access and run these programs through a number of web-based development environments and notebooks such as the Jupyter notebook, which is popular in the Python community and RStudio, which is the leading IDE amongst R users.
E. Control Feedback Loop
- After the process managers have performed their analysis, they can set up dynamic models for implementation in the cloud that can push back responses to the industrial processes.
- This process is explained further in the Near Real-Time Computing subsection.
F. Historical Big Data Repository
- In the cloud, the raw data and the process manager's recommendations will be stored at the historical big data repository (AWS S3).
- AWS offers great flexibility in storage plans that have the merit to be easily scaled as needed.
G. Near Real-time Computing
- Apache Kafka [12] is a publish-subscribe messaging application that enables sending and receiving streaming information between the plants and the Spark engine on the cloud.
- On the local computers (in the plants) a Kafka API (which consists of a few Java libraries) sends streaming data to a Kafka Server set up on AWS that manages the queue of information passed on to the Spark engine.
- The Spark engine then performs the streaming analysis and pushes back the results to the Kafka server and from there back to the plants.
H. Batch Computing
- In batch computing, the data are initially stored in the Historical Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers.
- In many cases, this step includes saving the data in the Parquet format which can reduce the size of the data by using the R or Python languages.
- In general, the process managers can choose from a vast array of ML algorithms that can be implemented on the cluster through the Spark engine.
III. THE USE CASE OF THE WALKING BEAM FURNACE
- The walking beam furnace is used to re-heat slabs (large steel beams) to a specific temperature before their refinement in the steel industry (see [13] ).
- In the described use case the main variables that need to be controlled are thus: a) the furnace temperatures in several zones of the furnace and b) the temperature of slabs at the output (the target temperature).
- For transferring data from the sensors to the cloud, a computer connected to the WBF process is utilized that is able to manage and update the site metadata, i.e. a Mefos-Service method which run preliminary for synchronization of factory list, zone list, sensor list, bath list and model list.
- The input messages are processed at the Kafka server by using a specific topic that it is known by both sides as the MefosService and the MefosSpark, while it requires suitable configurations e.g "ToSpark".
B. The Cloud side
- On the clouds side (AWS) there will be the Kafka server which will receive a streaming of data and will manage the queue.
- For the transfering of the results from the cloud back to the process, the Kafka server keeps the controls recommendations data and streams them on a specific output topic to some consumer, while the "MefosService" includes the Kafka-consumer feature that pulls the recommendations data from the output topic, e.g. "FromSpark".
- At the Spark-Streaming, the initial data are accumulated in the memory and afterwards are saved at a historical Big Data repository.
- In this article an example of a novel cloud computing infrastructure for big data analytics in the Process Control Industry has been presented.
- These developments have been carried in close relationship with the process industry, since it has been presented a use case at the walking beam furnace of the Steel Industry MEFOS in Sweden.
Did you find this useful? Give us your feedback
Citations
25 citations
Cites background from "Cloud computing for big data analyt..."
...Most of the recent research approaches examine cloud-based solutions for process data exchange and analytics as well as optimization [2, 3]....
[...]
7 citations
6 citations
6 citations
5 citations
Cites background from "Cloud computing for big data analyt..."
...These include neuroscience [28, 40, 43], geoinformatics [69, 74], machine learning [39], industrial process control [26], and computer vision [24, 64]....
[...]
References
12,539 citations
"Cloud computing for big data analyt..." refers background in this paper
...With the explosion of the “Internet of Things” [4] in the last decade, a world of new technologies has become readily accessible and relevant for the industrial process....
[...]
4,959 citations
"Cloud computing for big data analyt..." refers methods in this paper
...[9] for computing and an RStudio Server [10] as an analytic access point for the end-users....
[...]
4,440 citations
3,797 citations
"Cloud computing for big data analyt..." refers background or methods in this paper
...framework for distributed storage and processing of Big Data [8]....
[...]
...The Controls recommendations data are also accumulated at the memory and are saved at the historical Big Data repository that relies at the AWS S3 (Amazon Simple Storage Service), while the files will be saved as Parquet file type with the following benefits: 1) The structure of the table, i.e. the number of the columns, their types and the delimiter between columns, will be saved, 2) the data are compressed, a fact that saves about 60% of its volume compared to text file type, and 3) it enables the straight upload into Spark in memory data storage, no conversions will be needed....
[...]
...Furthermore, the historical Big Data repository will enable deep investigation of the data in case it is required for the development of new models, such as the BI reports, etc....
[...]
...Apache Hadoop is the leading open-source software framework for distributed storage and processing of Big Data [8]....
[...]
...In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7])....
[...]
2,384 citations
Additional excerpts
...In batch computing, data is first stored in a Big Data Repository where it can be properly cleaned, aggregated or transformed before being analyzed by the process managers (see [7])....
[...]
Related Papers (5)
Frequently Asked Questions (18)
Q2. What are the future works mentioned in the paper "Cloud computing for big data analytics in the process control industry" ?
Part of the future work includes the full extended experimentation and validation of the proposed scheme in WBF campaigns.
Q3. What is the main feature of Apache Spark?
The main feature of Apache Spark is its in-memory cluster computing that increases of the processing speed much faster than the Hadoop’s MapReduce technology.
Q4. What is the way to control the temperature of the furnace?
Since the heat distribution affects the quality of the finished product, a natural optimal control problem in this context is to regulate pre-assigned temperatures at specific points of the furnace, while minimizing the energy expenditure for the heat generation (see [14], [15]).
Q5. What is the main objective of the walking beam furnace?
The walking beam furnace is used to re-heat slabs (large steel beams) to a specific temperature before their refinement in the steel industry (see [13]).
Q6. How does the Spark streaming process handle the data?
The Spark streaming process consumes measurements data from the Kafka server, store it in the memory, and feeds the relevant process models at 10 sec.
Q7. What is the core component of Hadoop?
While Hadoop encompasses a suite of Apache software programs that help manage the tasks on the distributed system, the two core components of Hadoop are:1) Hadoop Distributed File System (HDFS) -
Q8. What is the main purpose of the Spark engine?
On the local computers (in the plants) a Kafka API (which consists of a few Java libraries) sends streaming data to a Kafka Server set up on AWS that manages the queue of information passed on to the Spark engine.
Q9. What is the main purpose of Apache Kafka?
Apache Kafka [12] is a publish-subscribe messaging application that enables sending and receiving streaming information between the plants and the Spark engine on the cloud.
Q10. What is the use case for the WBF?
In the presented use case it is intended to stream the data on-line, near real-time from the process by using the Kafkaproducer component, to Kafka service in the cloud, while the Apache Kafka publishes-subscribes messaging applications.
Q11. What is the main feature of Hadoop?
In the presented prototype architecture it has been utilized Hadoop as a framework for setting up the HDFS cluster on which the sensor data are stored.
Q12. What is the main contribution of this article?
The current technological advancements in cloud computing for big data processing, open new opportunities for the industry, while acting as an enabler for a significant reduction in costs, making the technology available to plants of all sizes.
Q13. What is the purpose of the proposed prototype architecture for batch processing over the Cloud?
In the proposed prototype architecture for batch processing over the Cloud, users (industrial processes) were given access to an Amazon web portal for S3 storage services.
Q14. What is the purpose of the Spark engine?
In addition, every 10 minutes the Spark server sends the accumulated data to the Historical Big Data Repository for future use or for batch computing.
Q15. What is the schematic diagram of the WBF?
For transferring data from the sensors to the cloud, a computer connected to the WBF process is utilized that is able to manage and update the site metadata, i.e. a MefosService method which run preliminary for synchronization of factory list, zone list, sensor list, bath list and model list.
Q16. What is the main purpose of the Big Data repository?
the historical Big Data repository will enable deep investigation of the data in case it is required for the development of new models, such as the BI reports, etc.
Q17. What is the minimum data input for the optimal control module?
For this usecase, the variables required from the optimal control module are the following ones in Figure 5:The minimum data input for the optimal control is 200 past values of the averages every 10 seconds of the above parameters (one value every 10 seconds in the last 2,000 seconds, i.e. 33 minute and 20 seconds) is required.
Q18. What is the name of the corresponding topic?
The input messages are processed at the Kafkaserver by using a specific topic that it is known by both sides as the MefosService and the MefosSpark, while it requires suitable configurations e.g ”ToSpark”.