Journal Article•DOI•

High-performance modelling and simulation for big data applications

Joanna Kolodziej, Horacio González-Vélez¹, Helen D. Karatza²•Institutions (2)

National College of Ireland¹, Aristotle University of Thessaloniki²

01 Aug 2017-Simulation Modelling Practice and Theory (Springer)-Vol. 76, pp 1-2

TL;DR: This open access book is the final compendium of case studies emanated from the 4-year COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications” (cHiPSet), set to become a required reference for the fast-changing fields of HPC, Big Data, and Modelling & Simulation.

read less

About: This article is published in Simulation Modelling Practice and Theory.The article was published on 2017-08-01 and is currently open access. It has received 24 citations till now. The article focuses on the topics: Chipset & Big data.

...read moreread less

Summary (6 min read)

1 Introduction

This chapter presents a position survey on the overall objective and specific challenges encompassing the state of the art in forecasting cryptocurrency value by Sentiment Analysis.
Further possibilities are then explored, based on this new metric perspective, such as technical analysis, forecasting, and beyond.
While High-Performance Computing (HPC) and Cloud Computing are not sine qua non for cryptocurrencies, their use has become pervasive in their transaction verification (“mining”).
Then, the Conclusion section summarizes the surveyed perspectives.

2 C. Grelck et al.

Semi-structured data first into valuable information and then into meaningful knowledge.
The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications facilitates cross-pollination between the HPC community (both developers and users) and M&S disciplines for which the use of HPC facilities, technologies and methodologies still is a novel, if any, phenomenon.
They often require a significant amount of computational resources with data sets scattered across multiple sources and different geographical locations.
Modelling has traditionally addressed complexity by raising the level of abstraction and aiming at an essential representation of the domain at hand.
Domain-specific considerations may put some more or even almost all emphasis on other factors, such as usability, productivity, economic cost and time to solution.

4 C. Grelck et al.

Following this introductory part, the authors have a closer look at the subjects relevant to the four working groups that make up the COST Action IC1406.
The authors focus on Enabling Infrastructures and Middleware for Big-Data Modelling and Simulation in Sect. 3, Parallel Programming Models for Big-Data Modelling and Simulation in Sect. 4, HPC-enabled Modelling and Simulation for Life Sciences in Sect. 5, HPC-enabled Modelling and Simulation for Socio-Economical and Physical Sciences in Sect. 6, respectively.
Last, but not least, the authors draw some conclusions in Sect.

2 Background and State of the Art

High-Performance Computing is currently undergoing a major change with exascale systems expected for the early 2020s.
Data-intensive (big data) HPC is arguably fundamental to address grand-challenge M&S problems.
The development of new complex HPC-enabled M&S applications requires collaborative efforts from researchers with different domain knowledge and expertise.
In bio-medical studies, wet-lab validation typically involves additional resource-intensive work that has to be geared towards a statistically significant distilled fragment of the computational results, suitable to confirm the bio-medical hypotheses and compatible with the available resources.
Big data is an emerging paradigm whose size and features are beyond the ability of the current M&S tools [6].

6 C. Grelck et al.

Suitable skills for the parallel implementation of data-intensive applications.
Therefore, another natural objective of their work is to intelligently transfer the heterogeneous workflows in M&S to HPC, which will boost those scientific fields that are essential for both M&S and HPC societies [7].
M&S experts are be supported in their investigations by properly-enabled HPC frameworks, currently sought but missing.
HPC architects in turn obtain access to a wealth of application domains by means of which they will better understand the specific requirements of HPC in the big data era.
Among others, the authors aim at the design of improved data-center oriented programming models and frameworks for HPC-enabled M&S.

3 Enabling Infrastructures and Middleware for Big-Data Modelling and Simulation

From the inception of the Internet, one has witnessed an explosive growth in the volume, speed and variety of electronic data created on a daily basis.
The so-called big data problem requires the continuous improvement of servers, storage, and the whole network infrastructure in order to enable the efficient analysis and interpretation of data through on-hand data management applications, e.g. agent-based solutions in Agent Component in Oracle Data Integrator (ODI).
A survey of software tools for supporting cluster, grid and cloud computing is provided in [15,17,18].
Job scheduling, load balancing and management play a crucial role in HPC and big data simulation [27,28].
Some of the best-known include: Spark, Pig, Hive, JAQL, Sqoop, Oozie, Mahout, etc. Apache Spark [33], a unified engine for big data processing, provides an alternative to MapReduce that enables workloads to execute in memory, instead of on disk.

8 C. Grelck et al.

Apache Storm [34] is a scalable, rapid, fault-tolerant platform for distributed computing that has the advantage of handling real time data processing downloaded from synchronous and asynchronous systems.
Numerous tools for big data analysis, visualisation and machine learning have been made available.
New software applications have been developed for browsing, visualizing, interpreting and analyzing large-scale sequencing data.
Synchronous and asynchronous distributed simulation have been one of the options that could improve the scalability of a simulator both in term of application size and execution speed, enabling large scale systems to be simulated in real time [43,44].
JADE [52] is the heterogeneous multiprocessor design simulation environment that allows to simulate networkon-chips, inter-chip networks and intra-rack networks using optical and electrical interconnects.

4 Parallel Programming Models for Big-Data Modelling and Simulation

A core challenge in modelling and simulation is the need to combine software expertise and domain expertise.
Even starting from well-defined mathematical models, manual coding is inevitable.
This may impair time-to-solution, performance, and performance portability across different platforms.
In the domain-specific languages (DSL) approach abstractions aim to provide domain experts with programming primitives that match specific concepts in their domain.

4.1 Languages and Frameworks for Big Data Analysis

Boosted by big data popularity new languages and frameworks for data analytics are appearing at an increasing pace.
Each of them introduces its own concepts and terminology and advocates a (real or alleged) superiority in terms of performance or expressiveness against its predecessors.
For a user approaching big data analytics (even an educated computer scientist) it is increasingly difficult to retain a clear picture of the programming model underneath these tools and the expressiveness they provide to solve some user-defined problem.

To provide some order in the world of big data processing, a toolkit of models

To identify their common features is introduced, starting from data layout.
Data-processing applications are divided into batch vs. stream processing.
For a complete description of the Dataflow model the authors refer back to [6,70], where the main features of mainstream languages are presented.
Based on the map and reduce functions, commonly used in parallel and functional programming [73], MapReduce provides a native keyvalue model and built-in sorting facilities.
Each flat-map executor Why HPC Modelling and Simulation for Big Data Applications Matters 11 emits R (i.e. the number of intermediate partitions) chunks, each containing the intermediate key-value pairs mapped to a given partition.

3. performs the reduction on a per-key basis.

Finally, a downstream collector gathers R tokens from the reduce executors and merges them into the final result.
This poses severe challenges from the implementation perspective.
As a key feature HDFS exposes the locality for stored data, thus enabling the principle of moving the computation towards the data and to minimise communication.
Disk-based communication leads to performance problems when dealing with iterative computations, such as machine learning algorithms [74].
Instead of a fixed processing schema, Spark allows datasets to be processed by means of arbitrarily composed primitives, constructing a directed acyclic graph (DAG).

12 C. Grelck et al.

Similar to the MapReduce implementation, Spark’s execution model relies on the master-Workers model: a cluster manager (e.g. YARN) manages resources and supervises the execution of the program.
Each of these actors represents independent data-parallel tasks, on which pipeline parallelism is exploited.
Currently, they include, among others, Apache Flink, Apache Spark and Google Cloud Dataflow.
Bounded PCollections can be processed using batch jobs, that might read the entire data set once and perform processing as a finite job.
That graph is then executed using the appropriate distributed processing back-end, becoming an asynchronous job/process on that back-end.

4.2 The Systematic Mapping Study on Parallel Programming Models for Big-Data Modelling and Simulation

In order to minimize the bias, given that many Action participants actively design programming models and tools, the working group refined and adopted a systematic methodology to study the state of the art, called systematic mapping study (SMS).
The mapping study focused on the main paradigms and properties of programming languages used in highperformance computing for gig data processing.

14 C. Grelck et al.

Specifically, the SMS focused on domain-specific languages and explicitly excluded general-purpose languages, such as C, C++, OpenMP, Fortan, Java, Python, Scala, etc., combined with parallel exploitation libraries, such as MPI.
Quantitatively, in the SMS, the initial literature search resulted in 420 articles; 152 articles were retained for final review after the evaluation of initial search results by domain experts.
Results of their mapping study indicate, for instance, that the majority of the used HPC languages in the context of big data are text-based general-purpose programming languages and target the enduser community.
To evaluate the outcome of the mapping study, the authors developed a questionnaire and collected the opinions of domain experts.

5 HPC-Enabled Modelling and Simulation for Life Sciences

Life Sciences typically deal with and generate large amounts of data, e.g., the flux of terabytes about genes and their expression produced by state of the art sequencing and microarray equipment, or data relating to the dynamics of cell biochemistry or organ functionality.
The authors will consider approaches for modelling healthcare and diseases as well as problems in systems and synthetic biology.
Taking into account only the DNA sequencing data, its rate of accumulation is much larger than other major generators of big data, such as astronomy, YouTube and Twitter.
Areas such as systems medicine, clinical informatics, systems biology and bioinformatics have large overlaps with classical fields of medicine, and extensively use biological information and computational methods to infer new knowledge towards understanding disease mechanism and diagnosis.
A patient’s condition is characterised by multiple, complex and interrelated conditions, disorders or diseases [87,88].

16 C. Grelck et al.

And markers re-modulation; the establishment of clinical decision support systems.
This could be of great importance for epigenetic data, which shows alteration with ageing, inflammatory diseases, obesity, cardiovascular and neurodegenerative diseases.
Dependant on the magnitude of mechanical stress osteoprogenitors differentiate or transdifferentiate into osteoblastlike cells that express characteristic proteins and can form bone matrix.
The transition between a continuous representation and a discrete representation makes the coupling of the models across the cell-tissue scale particularly difficult.
Conventional homogenisation approaches, frequently used as relation models to link to component models defined at different scales, are computationally resource demanding [89–92].

18 C. Grelck et al.

In recent years, thanks to faster and cheaper sequencing machines, a huge amount of whole genomic sequences within the same population has become available (e.g. [99]).
An elastic-degenerate text (ED-text) is a sequence compactly representing a multiple alignment of several closely-related sequences: substrings that match exactly are collapsed, while those in positions where the sequences differ (by means of substitutions, insertions, and deletions of substrings) are called degenerate, and therein all possible variants observed at that location are listed [105].
This problem has been efficiently solved in [112] with a linear time algorithm for the case of non-elastic D-texts (a degenerate segment can only contain strings of the same size).
Solutions have typically an exponential computational complexity.
WHATSHAP [113] is a framework returning exact solutions to the problem of haplotyping which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology’s current trends that are producing longer fragments.

20 C. Grelck et al.

Many functional modules are linked together in a Metabolic Network for reproducing metabolic pathways and describing the entire cellular metabolism of an organism.
An integrated approach based on statistical, topological, and functional analysis allows for obtaining a deep knowledge on overall metabolic network robustness.
So, ultra-peripheral non-hub nodes can assume a fundamental role for network survival if they belong to network extreme pathways, while hub nodes can have a limited impact on networks if they can be replaced by alternative nodes and paths [115,116].
The same approach have been applied as a bio-inspired optimisation method to different application domains.
The computational analysis of complex biological systems can be hindered by three main factors:.

1. modelling the system so that it can be easily understood and analysed by non-expert users is not always possible;

When the system is composed of hundreds or thousands of reactions and chemical species, the classic CPU-based simulators could not be appropriate to efficiently derive the behaviour of the system.
These methods often need an amount of experimental data that not always is available.
The system behaviour is described in detail by a system of ordinary differential equations (ODE) while model indetermination is resolved selecting time-varying coefficients that maximize/minimize the objective function at each ODE integration step.
Some interesting applications in this context are based on the study of integrated biological data and how they are organised in complex systems.
The persistent challenges in the healthcare sector call for urgent review of strategies.

22 C. Grelck et al.

There has also been diverse application of operations management techniques in several domains including the health sector.
A major classification identified resource and facility management, demand forecasting, inventory and supply chain management, and cost measurement as application groupings to prioritise [126].
Challenges do also arise around patient workflow: admission, scheduling, and resource allocation.
This obviously comes with the need for adequate computing and storage capabilities.
The choice of model and/or simulation technique can ultimately be influenced by available computing power and storage space.

6 HPC-Enabled Modelling and Simulation for Socio-Economical and Physical Sciences

Many types of decisions in society are supported by modelling and simulation.
The authors can roughly divide the applications within the large and diverse area, that they here call socio-economical and physical sciences, into two groups.
In classical HPC applications, the need for HPC arises from the fact that the authors have a large-scale model or a computationally heavy software implementation, that needs to make use of large-scale computational resources, and potentially also large-scale storage resources in order to deliver timely results.
The opportunities for using data in new ways are endless, but as is suggested in [138], data and algorithms together can provide the whats, while the innovation and imagination of human interpreters is still needed to answer the whys.
Wing design is one of the essential procedures of aircraft manufactures and it is a compromise between many competing factors and constraints.

24 C. Grelck et al.

Necessary derivatives can easily be calculated by applying finite-difference methods.
As a thriving application platform, HPC excels in supporting execution and it’s speedup through parallellisation when running Computational Intelligence (CI) algorithms.
The likes of CI algorithms supported by this action includes development of some of most efficient optimization algorithms for continuous optimization as defined with benchmark functions competition framework from Congress on Evolutionary Computation (CEC) 2017 [143,144].
IoT assumes that multiple sensors can be used to monitor the real-world and this information can be stored and processed, jointly with information from soft-sensor (RSS, web, etc.) [155], to for example assist elderly people in the street [156], develop intelligent interfaces [157] or detect anomalies in industrial environments [158].
Concentration of these data at a decision-making location may also allow travel time estimation, exploitation of network locality information, as well as comparison with the estimates provided by a traffic management system, which can be evaluated for effectiveness on the medium term and possibly tuned accordingly.

26 C. Grelck et al.

Risk management, in insurance, and in prediction of catastrophic climate events.
In a later chapter, methods for extreme value estimation are surveyed.

7 Summary and Conclusion

HPC and M&S form two previously largely disjoint and disconnected research communities.
The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications brings these two communities together to tackle the challenges of big data applications from diverse application domains.
Having set the scene in this paper, the other papers of this volume exemplify the achievements of the COST Action.

Did you find this useful? Give us your feedback

Figures (21)

Fig. 10. Fusion of mobile phone data with other sources.

Fig. 13. General workflow for MD simulations.

Table 3. Characterization of the steps in the MLFMA algorithm. The number of sources in group j is denoted by nj .

Fig. 9. Basic workflow of ligand-based modeling applications.

Fig. 12. Basic workflow for docking simulations.

Table 4. The parallel execution time Tp, the speedup Sp, the speedup in relation to the theoretical speedup S∗p , and the utilization Up computed as the fraction of time spent executing tasks, for two problem settings.

Fig. 11. Basic workflow for homology model generation.

Table 5. Classification of cloud services.

Table 6. Execution times in ms for the SuperGlue (SG) and OpenMP (OMP) implementations executed on the Sandy Bridge system.

Frequently Asked Questions (20)

Q1. What are the contributions mentioned in the paper "High-performance modelling and simulation for big data applications" ?

In this introductory article the authors argue why joining forces between M & S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, the authors provide an overview on the state of the art in the various research areas concerned.

Q2. What are the future works mentioned in the paper "High-performance modelling and simulation for big data applications" ?

In the following work, some more specific implementations and experimental results could be presented, based on the guidelines, outlines, and integration possibilities presented in this chapter. Author RS also acknowledges that work was supported by the Ministry of Education, Forecasting Cryptocurrency Value by Sentiment Analysis 341 Youth and Sports of the Czech Republic within the National Sustainability Programme Project No. LO1303 ( MSMT-7778/2014 ), further supported by the European Regional Development Fund under the Project CEBIA-Tech no.

Q3. What is the advantage of Apache Storm?

Apache Storm [34] is a scalable, rapid, fault-tolerant platform for distributed computing that has the advantage of handling real time data processing downloaded from synchronous and asynchronous systems.

Q4. How can the authors unify all methods for computing total derivatives?

By using the method of modular analysis and unified derivatives (MAUD), the authors can unify all methods for computing total derivatives using a single equation with associated distributed-memory, sparse data-passing schemes.

Q5. What is the main reason for the complexity of the medical approach to comorbidities?

The medical approach to comorbidities represents an impressive computational challenge, mainly because of data synergies leading to the integration of heterogeneous sources of information, the definition of deep phenotypingand markers re-modulation; the establishment of clinical decision support systems.

Q6. Why can't the authors use a genome fragment of limited size?

Due to biotechnologies limitations, sequencing (that is, giving as input the in vitro DNA and getting out an in silico text file) can only be done on a genome fragment of limited size.

Q7. What was the main sideeffect of cHiPSet COST?

in the case of EU project RIVR (Upgrading National Research Structures in Slovenia) supported by European Regional Development Fund (ERDF), an important sideeffect of cHiPSet COST action was leveraging it’s experts’ inclusiveness to gain capacity recognition at a national ministry for co-financing HPC equipment1.

Q8. What is the expensive process for complex disease management?

In particular, complex disease management is mostly based on electronic health records collection and analysis, which are expensive processes.

Q9. What are the main reasons why of these applications are not HPC enabled?

Since most of these applications belong to domains within the life, social and physical sciences, their mainstream approaches are rooted in non-computational abstractions and they are typically not HPC-enabled.

Q10. What are the two types of applications that are used to make decisions?

Classical HPC applications, where the authors build a large-scale complex model and simulate this in order to produce data as a basis for decisions, and Big data applications, where the starting point is a data set, that is processed and analyzed to learn the behaviour of a system, to find relevant features, and to make predictions or decisions.

Q11. What could be done to trace cancer clones?

For instance by using the Next Generation Sequencing technology approaches cancer clones, subtypes and metastasis could be appropriately traced.

Q12. What is the common way to validate data in biomedical studies?

For instance, in bio-medical studies, wet-lab validation typically involves additional resource-intensive work that has to be geared towards a statistically significant distilled fragment of the computational results, suitable to confirm the bio-medical hypotheses and compatible with the available resources.

Q13. What is the way to make the data the drivers of paths to cures for many complex?

Their chances to make the data the drivers of paths to cures for many complex diseases depends in a good percentage on extracting evidences from large-scale electronic records comparison and on models of disease trajectories.

Q14. What is the popular open source framework for modeling and simulation of cloud computing infrastructures and services?

CloudSim [54] is one of the most popular open source framework for modeling and simulation of cloud computing infrastructures and services.

Q15. What is the main reason for the growth of the e-healthcare field?

The growth is driven by three main factors:1. Biomedicine is heavily interdisciplinary and e-Healthcare requires physicians, bioinformaticians, computer scientists and engineers to team up.

Q16. What is the framework for modelling and simulating a particular use-case?

The optimum framework for modelling and simulating a particular use-case depends on the availability, structure and size of data [126].

Q17. How many articles were retained for final review?

in the SMS, the initial literature search resulted in 420 articles; 152 articles were retained for final review after the evaluation of initial search results by domain experts.

Q18. What are some of the tools that are developed specifically for the visualisation of mapped read alignment?

Other tools, such as BamView [40] have been developed specifically to visualise mapped read alignment data in the context of the reference sequence.

Q19. What are some of the approaches that have been successful?

Some approaches have been successful, leading to potential industrial impact and supporting experiments that generate petabytes of data, like those performed at CERN for instance.

Q20. What are the main limitations of the computational analysis of complex biological systems?

The computational analysis of complex biological systems can be hindered by three main factors:2. When the system is composed of hundreds or thousands of reactions and chemical species, the classic CPU-based simulators could not be appropriate to efficiently derive the behaviour of the system.