scispace - formally typeset
Search or ask a question

Showing papers by "Per Stenström published in 2016"


Patent
20 May 2016
TL;DR: In this article, the best suited compression method and device among two or a plurality of compression methods and devices, which are combined together, is selected using as main selection criterion the dominating data type in a data block by predicting the data types within said data block.
Abstract: Methods, devices and systems enhance compression and decompression of data blocks of data values by selecting the best suited compression method and device among two or a plurality of compression methods and devices, which are combined together and which said compression methods and devices compress effectively data values of particular data types; said best suited compression method and device is selected using as main selection criterion the dominating data type in a data block by predicting the data types within said data block.

25 citations


Proceedings ArticleDOI
12 Mar 2016
TL;DR: This paper proposes RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs, and considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model, a history-based scheme (Look-back) and a combined scheme ( look-ahead and Look-back).
Abstract: Last-level caches (LLCs) bridge the processor/memory speed gap and reduce energy consumed per access. Unfortunately, LLCs are poorly utilized because of the relatively large occurrence of dead blocks. We propose RADAR, a hybrid static/dynamic dead-block management technique that can accurately predict and evict dead blocks in LLCs. RADAR does dead-block prediction and eviction at the granularity of address regions supported in many of today's task-parallel programming models. The runtime system utilizes static control-flow information about future region accesses in conjunction with past region access patterns to make accurate predictions about dead regions. The runtime system instructs the cache to demote and eventually evict blocks belonging to such dead regions. This paper considers three RADAR schemes to predict dead regions: a scheme that uses control-flow information provided by the programming model (Look-ahead), a history-based scheme (Look-back) and a combined scheme (Look-ahead and Look-back). Our evaluation shows that, on average, all RADAR schemes outperform state-of-the-art hardware dead-block prediction techniques, whereas the combined scheme always performs best.

20 citations


Proceedings ArticleDOI
14 Mar 2016
TL;DR: A snapshot summary of the trends in the area of micro-server development and their application in the broader enterprise and cloud markets and specifically the differentiation and uniqueness of the approach being adopted by the EUROSERVER FP7 project is provided.
Abstract: This paper provides a snapshot summary of the trends in the area of micro-server development and their application in the broader enterprise and cloud markets. Focusing on the technology aspects, we provide an understanding of these trends and specifically the differentiation and uniqueness of the approach being adopted by the EUROSERVER FP7 project. The unique technical contributions of EUROSERVER range from the fundamental system compute unit design architecture, through to the implementation approach both at the chiplet nanotechnological integration, and the everything-close physical form factor. Furthermore, we offer optimizations at the virtualisation layer to exploit the unique hardware features, and other framework optimizations, including exploiting the hardware capabilities at the run-time system and application layers.

20 citations


Proceedings ArticleDOI
01 Nov 2016
TL;DR: A new timing-anomaly-free dynamic scheduling algorithm, called the Out-of-(priority)-order Lazy (O-Lazy) that offers safe and tighter estimation of the makespan of parallel applications.
Abstract: Multicore architectures can provide high predictable performance through parallel processing. Unfortunately, computing the makespan of parallel applications is overly pessimistic either due to load imbalance issues plaguing static scheduling methods or due to timing anomalies plaguing dynamic scheduling methods. This paper contributes with an anomaly-free dynamic scheduling method, called Lazy, which is non-preemptive and non-greedy in the sense that some ready tasks may not be dispatched for execution even if some processors are idle. Assuming parallel applications using contemporary taskbased parallel programming models, such as OpenMP, the general idea of Lazy is to avoid timing anomalies by assigning fixed priorities to the tasks and then dispatch selective highestpriority ready tasks for execution at each scheduling point. We formally prove that Lazy is timing-anomaly free. Unlike all the commonly-used dynamic schedulers like breadth-first and depth-first schedulers (e.g., CilkPlus) that rely on analytical approaches to determine an upper bound on the makespan of parallel application, a safe makespan of a parallel application is computed by simulating Lazy. Our experimental results show that the makespan computed by simulating Lazy is much tighter and scales better as demonstrated by four parallel benchmarks from a task-parallel benchmark suite in comparison to the state-of-the-art.

10 citations


Journal ArticleDOI
TL;DR: This paper develops a dynamic prefetching tuning scheme, named prefetch automatic tuner (PATer), which uses a prediction model based on machine learning to dynamically tune the prefetch configuration based on the values of hardware performance monitoring counters (PMCs).
Abstract: Hardware prefetching on IBM's latest POWER8 processor is able to improve performance of many applications significantly, but it can also cause performance loss for others. The IBM POWER8 processor provides one of the most sophisticated hardware prefetching designs which supports 225 different configurations. Obviously, it is a big challenge to find the optimal or near-optimal hardware prefetching configuration for a specific application. We present a dynamic prefetching tuning scheme in this paper, named prefetch automatic tuner (PATer). PATer uses a prediction model based on machine learning to dynamically tune the prefetch configuration based on the values of hardware performance monitoring counters (PMCs). By developing a two-phase prefetching selection algorithm and a prediction accuracy optimization algorithm in this tool, we identify a set of selected key hardware prefetch configurations that matter mostly to performance as well as a set of PMCs that maximize the machine learning prediction accuracy. We show that PATer is able to accelerate the execution of diverse workloads up to 1.4×.

10 citations


Patent
20 May 2016
TL;DR: In this paper, a plurality of semantically meaningful data fields of floating-point numbers is used to accelerate compression and decompression of data values, where multiple compressors and decompressors can be used in parallel.
Abstract: Methods, devices and systems enhance compression and decompression of data values when they comprise a plurality of semantically meaningful data fields. According to a first inventive concept of the present invention disclosure, compression is not applied to each data value as a whole, but instead to at least one of the semantically meaningful data fields of each data value, and in isolation from the other ones. A second inventive concept organizes the data fields that share the same semantic meaning together to accelerate compression and decompression as multiple compressors and decompressors can be used in parallel. A third inventive concept is a system where methods and devices are tailored to perform compression and decompression of the semantically meaningful data fields of floating-point numbers after first partitioning further at least one of said data fields into two or a plurality of sub-fields to increase the degree of value locality and improve compressibility of floating-point values.

5 citations


Proceedings ArticleDOI
03 Oct 2016
TL;DR: Adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row- address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers is contributed.
Abstract: Modern commercial workloads drive a continuous demand for larger and still low-latency main memories. JEDEC member companies indicate that parallel memory protocols will remain key to such memories, though widening the bus (increasing the pin count) to address larger capacities would cause multiple issues ultimately reducing the speed (the peak data rate) and cost-efficiency of the protocols. Thus to stay high-speed and cost-efficient, parallel memory protocols should address larger capacities using the available number of pins. This is accomplished by multiplexing the pins to transfer each address in multiple bus cycles, implementing Multi-Cycle Addressing (MCA). However, additional address-transfer cycles can significantly worsen performance and energy efficiency. This paper contributes with the concept of adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row-address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers. For a case-study MCA protocol, the paper shows that the proposed concept improves: i) the read latency by 7.5% on average and up to 12.5%, and ii) the system-level performance and energy efficiency by 5.5% on average and up to 6.5%. This way, adaptive row addressing makes the MCA protocol as efficient as an idealistic protocol of the same speed but with enough pins to transfer each row address in a single bus cycle.

1 citations


Journal ArticleDOI
TL;DR: This column discusses the career and work of Christos Kozyrakis, the 2015 recipient of the ACM SIGARCH Maurice Wilkes award, which is given annually to an early-career researcher for an outstanding contribution to computer architecture.
Abstract: This column discusses the career and work of Christos Kozyrakis, the 2015 recipient of the ACM SIGARCH Maurice Wilkes award, which is given annually to an early-career researcher for an outstanding contribution to computer architecture.

01 Jan 2016
TL;DR: This paper proposes for the first time, a framework to estimate theWCET of dynamically scheduled parallel applications using a directed acyclic graph (DAG) and an anomaly-free priority-based new scheduling policy (called, the Lazy-BFS scheduler) is proposed to estimate safely the WCET.
Abstract: Estimating a safe and tight upper bound on the Worst-Case Execution Time (WCET) of a parallel program is a major challenge for the design of real-time systems. This paper, proposes for the first time, a framework to estimate the WCET of dynamically scheduled parallel applications. Assuming that the WCET can be safely estimated for a sequential task on a multicore system, we model a parallel application using a directed acyclic graph (DAG). The execution time of the entire application is computed using a breadth-first scheduler that simulates non-preemptive execution of the nodes of the DAG (called, the BFS scheduler). Experiments using Fibonacci application from the Barcelona OpenMP Task Suite (BOTS) show that timing anomalies are a major obstacle to estimate safely the WCET of parallel applications. To avoid such anomalies, the estimated execution time of an application computed under the simulation of BFS scheduler is multiplied by a constant factor to derive a safe bound on WCET. Finally, an anomaly-free priority-based new scheduling policy (called, the Lazy-BFS scheduler) is proposed to estimate safely the WCET. Experimental results show that the bound on WCET computed using Lazy-BFS is not only safe but also 30% tighter than that computed for BFS.