Home
/
Authors
/
Jian Li

Author

Jian Li

Other affiliations: University of Rhode Island, Cornell University

Bio: Jian Li is an academic researcher from IBM. The author has contributed to research in topics: Cache & Cache coloring. The author has an hindex of 17, co-authored 51 publications receiving 2286 citations. Previous affiliations of Jian Li include University of Rhode Island & Cornell University.

Topics: Cache, Cache coloring, Cache invalidation, Cache algorithms, Cache pollution ...read more

Papers published on a yearly basis

2022
2021
2014
2013
2012
2011
2010
2009
2008
2006
2005
2004
2003
2002

Papers

PDF

Open Access

More filters

Proceedings Article•DOI•

A novel architecture of the 3D stacked MRAM L2 cache for CMPs

[...]

Guangyu Sun¹, Xiangyu Dong¹, Yuan Xie¹, Jian Li², Yi Chen³ - Show less +1 more•Institutions (3)

Pennsylvania State University¹, IBM², Seagate Technology³

06 Mar 2009

TL;DR: This paper stacks MRAM-based L2 caches directly atop CMPs and compares it against SRAM counterparts in terms of performance and energy, and proposes two architectural techniques: read-preemptive write buffer and SRAM-MRAM hybrid L2 cache.

...read moreread less

Abstract: Magnetic random access memory (MRAM) is a promising memory technology, which has fast read access, high density, and non-volatility. Using 3D heterogeneous integrations, it becomes feasible and cost-efficient to stack MRAM atop conventional chip multiprocessors (CMPs). However, one disadvantage of MRAM is its long write latency and its high write energy. In this paper, we first stackMRAM-based L2 caches directly atop CMPs and compare it against SRAM counterparts in terms of performance and energy. We observe that the direct MRAM stacking might harm the chip performance due to the aforementioned long write latency and high write energy. To solve this problem, we then propose two architectural techniques: read-preemptive write buffer and SRAM-MRAM hybrid L2 cache. The simulation result shows that our optimized MRAM L2 cache improves performance by 4.91% and reduces power by 73.5%compared to the conventional SRAM L2 cache with the similar area.

...read moreread less

459 citations

Proceedings Article•DOI•

Hybrid cache architecture with disparate memory technologies

[...]

Xiaoxia Wu¹, Jian Li², Lixin Zhang², Evan Speight², Ramakrishnan Rajamony², Yuan Xie¹ - Show less +2 more•Institutions (2)

Pennsylvania State University¹, IBM²

20 Jun 2009

TL;DR: This paper discusses and evaluates two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based H CA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology.

...read moreread less

Abstract: Caching techniques have been an efficient mechanism for mitigating the effects of the processor-memory speed gap. Traditional multi-level SRAM-based cache hierarchies, especially in the context of chip multiprocessors (CMPs), present many challenges in area requirements, core-to-cache balance, power consumption, and design complexity. New advancements in technology enable caches to be built from other technologies, such as Embedded DRAM (EDRAM), Magnetic RAM (MRAM), and Phase-change RAM (PRAM), in both 2D chips or 3D stacked chips. Caches fabricated in these technologies offer dramatically different power and performance characteristics when compared with SRAM-based caches, particularly in the areas of access latency, cell density, and overall power consumption. In this paper, we propose to take advantage of the best characteristics that each technology offers, through the use of Hybrid Cache Architecture (HCA) designs. We discuss and evaluate two types of hybrid cache architectures: inter cache Level HCA (LHCA), in which the levels in a cache hierarchy can be made of disparate memory technologies; and intra cache level or cache Region based HCA (RHCA), where a single level of cache can be partitioned into multiple regions, each of a different memory technology. We have studied a number of different HCA architectures and explored the potential of hardware support for intra-cache data movement and power consumption management within HCA caches. Utilizing a full-system simulator that has been validated against real hardware, we demonstrate that an LHCA design can provide a geometric mean 7% IPC improvement over a baseline 3-level SRAM cache design under the same area constraint across a collection of 25 workloads. A more aggressive RHCA-based design provides 12% IPC improvement over the baseline. Finally, a 2-layer 3D cache stack (3DHCA) of high density memory technology within the same chip footprint gives 18% IPC improvement over the baseline. Furthermore, up to 70% reduction in power consumption over a baseline SRAM-only design is achieved.

...read moreread less

375 citations

Proceedings Article•DOI•

Dynamic power-performance adaptation of parallel computation on chip multiprocessors

[...]

Jian Li¹, Jose F. Martinez¹•Institutions (1)

Cornell University¹

27 Feb 2006

TL;DR: This work addresses the problem of dynamically optimizing power consumption of a parallel application that executes on a many-core CMP under a given performance constraint by presenting simple, low-overhead heuristics for dynamic optimization that significantly cut down on the search effort along both dimensions of the optimization space.

...read moreread less

Abstract: Previous proposals for power-aware thread-level parallelism on chip multiprocessors (CMPs) mostly focus on multiprogrammed workloads. Nonetheless, parallel computation of a single application is critical in light of the expanding performance demands of important future workloads. This work addresses the problem of dynamically optimizing power consumption of a parallel application that executes on a many-core CMP under a given performance constraint. The optimization space is two-dimensional, allowing changes in the number of active processors and applying dynamic voltage/frequency scaling. We demonstrate that the particular optimum operating point depends nontrivially on the power-performance characteristics of the CMP, the application's behavior, and the particular performance target. We present simple, low-overhead heuristics for dynamic optimization that significantly cut down on the search effort along both dimensions of the optimization space. In our evaluation of several parallel applications with different performance targets, these heuristics quickly lock on a configuration that yields optimal power savings in virtually all cases.

...read moreread less

286 citations

Proceedings Article•DOI•

The PERCS High-Performance Interconnect

[...]

Baba Arimilli¹, Ravi Kumar Arimilli¹, Vicente Enrique Chung¹, Scott Douglas Clark¹, Wolfgang Denzel¹, Ben Drerup¹, Torsten Hoefler², Jody B. Joyner¹, Jerry Don Lewis¹, Jian Li¹, Nan Ni¹, Ramakrishnan Rajamony¹ - Show less +8 more•Institutions (2)

IBM¹, University of Illinois at Urbana–Champaign²

18 Aug 2010

TL;DR: The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation that is expected to deliver sustained Pet scale performance over a wide range of applications.

...read moreread less

Abstract: The PERCS system was designed by IBM in response to a DARPA challenge that called for a high-productivity high-performance computing system. A major innovation in the PERCS design is the network that is built using Hub chips that are integrated into the compute nodes. Each Hub chip is about 580 mm$^2$ in size, % uses 45 nm IBM CMOS 12S0 SOI technology with 13 levels of metal, has over 3700 signal I/Os, and is packaged in a module that also contains LGA-attached optical electronic devices. The Hub module implements five types of high-bandwidth interconnects with multiple links that are fully-connected with a high-performance internal crossbar switch. These links provide over 9 Tbits/second of raw bandwidth and are used to construct a two-level direct-connect topology spanning up to tens of thousands of \PS{} chips with high bisection bandwidth and low latency. The Blue Waters System, which is being constructed at NCSA, is an exemplar large-scale PERCS installation. Blue Waters is expected to deliver sustained Pet scale performance over a wide range of applications. The Hub chip supports several high-performance computing protocols (e.g., MPI, RDMA, IP) and also provides a non-coherent system-wide global address space. Collective communication operations such as barriers, reductions, and multi-cast are supported directly in hardware. Multiple routing modes including deterministic as well as hardware-directed random routing are also supported. Finally, the Hub module is capable of operating in the presence of many types of hardware faults and gracefully degrades performance in the presence of lane failures.

...read moreread less

212 citations

Proceedings Article•DOI•

The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors

[...]

Jian Li¹, Jose F. Martinez¹, Michael C. Huang²•Institutions (2)

Cornell University¹, University of Rochester²

14 Feb 2004

TL;DR: This work presents the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance, and leverages the coherence protocol and proposes small hardware extensions to achieve timely wake-up of dormant threads.

...read moreread less

Abstract: Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.

...read moreread less

165 citations

1
2
3
4
…
5
6
7
8
9
10
11

Collapse

Cited by

PDF

Open Access

More filters

Fast parallel algorithms for short-range molecular dynamics

[...]

Steven J. Plimpton¹•Institutions (1)

Sandia National Laboratories¹

01 May 1993

TL;DR: Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems.

...read moreread less

Abstract: Three parallel algorithms for classical molecular dynamics are presented. The first assigns each processor a fixed subset of atoms; the second assigns each a fixed subset of inter-atomic forces to compute; the third assigns each a fixed spatial region. The algorithms are suitable for molecular dynamics models which can be difficult to parallelize efficiently—those with short-range forces where the neighbors of each atom change rapidly. They can be implemented on any distributed-memory parallel machine which allows for message-passing of data between independently executing processors. The algorithms are tested on a standard Lennard-Jones benchmark problem for system sizes ranging from 500 to 100,000,000 atoms on several parallel supercomputers--the nCUBE 2, Intel iPSC/860 and Paragon, and Cray T3D. Comparing the results to the fastest reported vectorized Cray Y-MP and C90 algorithm shows that the current generation of parallel machines is competitive with conventional vector supercomputers even for small problems. For large problems, the spatial algorithm achieves parallel efficiencies of 90% and a 1840-node Intel Paragon performs up to 165 faster than a single Cray C9O processor. Trade-offs between the three algorithms and guidelines for adapting them to more complex molecular dynamics simulations are also discussed.

...read moreread less

29,323 citations

DOI•

International Technology Roadmap for Semiconductors 2003の要求清浄度について－シリコンウエハ表面と雰囲気環境に要求される清浄度, 分析方法の現状について－

[...]

飯田裕幸, 竹田菊男, 藤本武利

20 Sep 2004

1,387 citations

Journal Article•DOI•

Trends in big data analytics

[...]

Karthik Kambatla¹, Giorgios Kollias², Vipin Kumar³, Ananth Grama¹•Institutions (3)

Purdue University¹, IBM², University of Minnesota³

01 Jul 2014-Journal of Parallel and Distributed Computing

TL;DR: An overview of the state-of-the-art and focus on emerging trends to highlight the hardware, software, and application landscape of big-data analytics are provided.

...read moreread less

699 citations

Proceedings Article•DOI•

An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget

[...]

Canturk Isci¹, Alper Buyuktosunoglu¹, C.-Y. Chen¹, Pradip Bose¹, Margaret Martonosi² - Show less +1 more•Institutions (2)

IBM¹, Princeton University²

09 Dec 2006

TL;DR: The results show that the best architected policies can come within 1% of the performance of an ideal oracle, while meeting a given chip-level power budget, and are significantly better than static management, even if static scheduling is given oracular knowledge.

...read moreread less

Abstract: Chip-level power and thermal implications will continue to rule as one of the primary design constraints and performance limiters. The gap between average and peak power actually widens with increased levels of core integration. As such, if per-core control of power levels (modes) is possible, a global power manager should be able to dynamically set the modes suitably. This would be done in tune with the workload characteristics, in order to always maintain a chip-level power that is below the specified budget. Furthermore, this should be possible without significant degradation of chip-level throughput performance. We analyze and validate this concept in detail in this paper. We assume a per-core DVFS (dynamic voltage and frequency scaling) knob to be available to such a conceptual global power manager. We evaluate several different policies for global multi-core power management. In this analysis, we consider various different objectives such as prioritization and optimized throughput. Overall, our results show that in the context of a workload comprised of SPEC benchmark threads, our best architected policies can come within 1% of the performance of an ideal oracle, while meeting a given chip-level power budget. Furthermore, we show that these global dynamic management policies perform significantly better than static management, even if static scheduling is given oracular knowledge.

...read moreread less

667 citations

Proceedings Article•DOI•

An integrated GPU power and performance model

[...]

Sunpyo Hong¹, Hyesoon Kim¹•Institutions (1)

Georgia Institute of Technology¹

19 Jun 2010

TL;DR: An integrated power and performance prediction model for a GPU architecture to predict the optimal number of active processors for a given application and the outcome of IPP is used to control the number of running cores.

...read moreread less

Abstract: GPU architectures are increasingly important in the multi-core era due to their high number of parallel processors. Performance optimization for multi-core processors has been a challenge for programmers. Furthermore, optimizing for power consumption is even more difficult. Unfortunately, as a result of the high number of processors, the power consumption of many-core processors such as GPUs has increased significantly. Hence, in this paper, we propose an integrated power and performance (IPP) prediction model for a GPU architecture to predict the optimal number of active processors for a given application. The basic intuition is that when an application reaches the peak memory bandwidth, using more cores does not result in performance improvement. We develop an empirical power model for the GPU. Unlike most previous models, which require measured execution times, hardware performance counters, or architectural simulations, IPP predicts execution times to calculate dynamic power events. We then use the outcome of IPP to control the number of running cores. We also model the increases in power consumption that resulted from the increases in temperature. With the predicted optimal number of active cores, we show that we can save up to 22.09%of runtime GPU energy consumption and on average 10.99% of that for the five memory bandwidth-limited benchmarks.

...read moreread less

514 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse