Home
/
Authors
/
Ting Cao

Author

Ting Cao

Other affiliations: Microsoft, Advanced Micro Devices, Chinese Academy of Sciences ...read more

Bio: Ting Cao is an academic researcher from Central South University. The author has contributed to research in topics: Computer science & Inference. The author has an hindex of 10, co-authored 31 publications receiving 504 citations. Previous affiliations of Ting Cao include Microsoft & Advanced Micro Devices.

Topics: Computer science, Inference, Medicine, Cache, Internal medicine ...read more

Papers published on a yearly basis

2023
2022
2021
2020
2019
2018
2016
2015
2014
2013
2012
2011
2010

Papers

PDF

Open Access

More filters

Journal Article•DOI•

The yin and yang of power and performance for asymmetric hardware and managed software

[...]

Ting Cao¹, Stephen M. Blackburn¹, Tiejun Gao¹, Kathryn S. McKinley²•Institutions (2)

Australian National University¹, University of Texas at Austin²

09 Jun 2012

TL;DR: On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power.

...read moreread less

Abstract: On the hardware side, asymmetric multicore processors present software with the challenge and opportunity of optimizing in two dimensions: performance and power. Asymmetric multicore processors (AMP) combine general-purpose big (fast, high power) cores and small (slow, low power) cores to meet power constraints. Realizing their energy efficiency opportunity requires workloads with differentiated performance and power characteristics. On the software side, managed workloads written in languages such as C#, Java, JavaScript, and PHP are ubiquitous. Managed languages abstract over hardware using Virtual Machine (VM) services (garbage collection, interpretation, and/or just-in-time compilation) that together impose substantial energy and performance costs, ranging from 10% to over 80%. We show that these services manifest a differentiated performance and power workload. To differing degrees, they are parallel, asynchronous, communicate infrequently, and are not on the application?s critical path. We identify a synergy between AMP and VM services that we exploit to attack the 40% average energy overhead due to VM services. Using measurements and very conservative models, we show that adding small cores tailored for VM services should deliver, at least, improvements in performance of 13%, energy of 7%, and performance per energy of 22%. The yin of VM services is overhead, but it meets the yang of small cores on an AMP. The yin of AMP is exposed hardware complexity, but it meets the yang of abstraction in managed languages. VM services fulfill the AMP requirement for an asynchronous, non-critical, differentiated, parallel, and ubiquitous workload to deliver energy efficiency. Generalizing this approach beyond system software to applications will require substantially more software and hardware investment, but these results show the potential energy efficiency gains are significant.

...read moreread less

104 citations

Proceedings Article•DOI•

Looking back on the language and hardware revolutions: measured power, performance, and scaling

[...]

Hadi Esmaeilzadeh¹, Ting Cao², Yang Xi², Stephen M. Blackburn², Kathryn S. McKinley³ - Show less +1 more•Institutions (3)

University of Washington¹, Australian National University², University of Texas at Austin³

05 Mar 2011

TL;DR: This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology, revealing the extent of some known and previously unobserved hardware and software trends.

...read moreread less

Abstract: This paper reports and analyzes measured chip power and performance on five process technology generations executing 61 diverse benchmarks with a rigorous methodology. We measure representative Intel IA32 processors with technologies ranging from 130nm to 32nm while they execute sequential and parallel benchmarks written in native and managed languages. During this period, hardware and software changed substantially: (1) hardware vendors delivered chip multiprocessors instead of uniprocessors, and independently (2) software developers increasingly chose managed languages instead of native languages. This quantitative data reveals the extent of some known and previously unobserved hardware and software trends. Two themes emerge.(I) Workload: The power, performance, and energy trends of native workloads do not approximate managed workloads. For example, (a) the SPEC CPU2006 native benchmarks on the i7 (45) and i5 (32) draw significantly less power than managed or scalable native benchmarks; and (b) managed runtimes exploit parallelism even when running single-threaded applications. The results recommend architects always include native and managed workloads when designing and evaluating energy efficient hardware.(II) Architecture: Clock scaling, microarchitecture, simultaneous multithreading, and chip multiprocessors each elicit a huge variety of power, performance, and energy responses. This variety and the difficulty of obtaining power measurements recommends exposing on-chip power meters and when possible structure specific power meters for cores, caches, and other structures. Just as hardware event counters provide a quantitative grounding for performance innovations, power meters are necessary for optimizing energy.

...read moreread less

89 citations

Journal Article•DOI•

Parallel Processing Systems for Big Data: A Survey

[...]

Yunquan Zhang¹, Ting Cao¹, Shigang Li¹, Xinhui Tian¹, Liang Yuan¹, Haipeng Jia¹, Athanasios V. Vasilakos² - Show less +3 more•Institutions (2)

Chinese Academy of Sciences¹, Luleå University of Technology²

19 Aug 2016

TL;DR: This survey paper will give a high-level overview of the existing parallel data processing systems categorized by the data input as batch processing, stream processing, graph processing, and machine learning processing and introduce representative projects in each category.

...read moreread less

Abstract: The volume, variety, and velocity properties of big data and the valuable information it contains have motivated the investigation of many new parallel data processing systems in addition to the approaches using traditional database management systems (DBMSs). MapReduce pioneered this paradigm change and rapidly became the primary big data processing system for its simplicity, scalability, and fine-grain fault tolerance. However, compared with DBMSs, MapReduce also arouses controversy in processing efficiency, low-level abstraction, and rigid dataflow. Inspired by MapReduce, nowadays the big data systems are blooming. Some of them follow MapReduce's idea, but with more flexible models for general-purpose usage. Some absorb the advantages of DBMSs with higher abstraction. There are also specific systems for certain applications, such as machine learning and stream data processing. To explore new research opportunities and assist users in selecting suitable processing systems for specific applications, this survey paper will give a high-level overview of the existing parallel data processing systems categorized by the data input as batch processing, stream processing, graph processing, and machine learning processing and introduce representative projects in each category. As the pioneer, the original MapReduce system, as well as its active variants and extensions on dataflow, data access, parameter tuning, communication, and energy optimizations will be discussed at first. System benchmarks and open issues for big data processing will also be studied in this survey.

...read moreread less

80 citations

Proceedings Article•DOI•

nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices

[...]

Li Lyna Zhang¹, Shihao Han², Jianyu Wei³, Ningxin Zheng¹, Ting Cao¹, Yuqing Yang¹, Yunxin Liu⁴ - Show less +3 more•Institutions (4)

Microsoft¹, Rose-Hulman Institute of Technology², University of Science and Technology of China³, Tsinghua University⁴

24 Jun 2021

TL;DR: The nn-Meter as discussed by the authors model predicts the inference latency of DNN models on diverse edge devices by dividing a whole model inference into kernels and conducting kernel-level prediction.

...read moreread less

Abstract: With the recent trend of on-device deep learning, inference latency has become a crucial metric in running Deep Neural Network (DNN) models on various mobile and edge devices. To this end, latency prediction of DNN model inference is highly desirable for many tasks where measuring the latency on real devices is infeasible or too costly, such as searching for efficient DNN models with latency constraints from a huge model-design space. Yet it is very challenging and existing approaches fail to achieve a high accuracy of prediction, due to the varying model-inference latency caused by the runtime optimizations on diverse edge devices. In this paper, we propose and develop nn-Meter, a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices. The key idea of nn-Meter is dividing a whole model inference into kernels, i.e., the execution units on a device, and conducting kernel-level prediction. nn-Meter builds atop two key techniques: (i) kernel detection to automatically detect the execution unit of model inference via a set of well-designed test cases; and (ii) adaptive sampling to efficiently sample the most beneficial configurations from a large space to build accurate kernel-level latency predictors. Implemented on three popular platforms of edge hardware (mobile CPU, mobile GPU, and Intel VPU) and evaluated using a large dataset of 26,000 models, nn-Meter significantly outperforms the prior state-of-the-art.

...read moreread less

76 citations

Journal Article•DOI•

WADE: Writeback-aware dynamic cache management for NVM-based main memory system

[...]

Zhe Wang¹, Shuchang Shan, Ting Cao², Junli Gu³, Yi Xu³, Shuai Mu⁴, Yuan Xie⁵, Daniel A. Jimenez¹ - Show less +4 more•Institutions (5)

Texas A&M University¹, Australian National University², Advanced Micro Devices³, Tsinghua University⁴, Pennsylvania State University⁵

01 Dec 2013-ACM Transactions on Architecture and Code Optimization

TL;DR: A Writeback-Aware Dynamic CachE (WADE) management technique to help mitigate the write overhead in NVM-based memory that tries to keep highly reused dirty cache blocks in the Last-Level Cache.

...read moreread less

Abstract: Emerging Non-Volatile Memory (NVM) technologies are explored as potential alternatives to traditional SRAM/DRAM-based memory architecture in future microprocessor design. One of the major disadvantages for NVM is the latency and energy overhead associated with write operations. Mitigation techniques to minimize the write overhead for NVM-based main memory architecture have been studied extensively. However, most prior work focuses on optimization techniques for NVM-based main memory itself, with little attention paid to cache management policies for the Last-Level Cache (LLC). In this article, we propose a Writeback-Aware Dynamic CachE (WADE) management technique to help mitigate the write overhead in NVM-based memory.1 The proposal is based on the observation that, when dirty cache blocks are evicted from the LLC and written into NVM-based memory (with PCM as an example), the long latency and high energy associated with write operations to NVM-based memory can cause system performance/power degradation. Thus, reducing the number of writeback requests from the LLC is critical. The proposed WADE cache management technique tries to keep highly reused dirty cache blocks in the LLC. The technique predicts blocks that are frequently written back in the LLC. The LLC sets are dynamically partitioned into a frequent writeback list and a nonfrequent writeback list. It keeps a best size of each list in the LLC. Our evaluation shows that the technique can reduce the number of writeback requests by 16.5p for memory-intensive single-threaded benchmarks and 10.8p for multicore workloads. It yields a geometric mean speedup of 5.1p for single-thread applications and 7.6p for multicore workloads. Due to the reduced number of writeback requests to main memory, the technique reduces the energy consumption by 8.1p for single-thread applications and 7.6p for multicore workloads.

...read moreread less

51 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Dark Silicon and the End of Multicore Scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², R. St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

01 May 2012-IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Abstract: A key question for the microprocessor research and design community is whether scaling multicores will provide the performance and value needed to scale down many more technology generations. To provide a quantitative answer to this question, a comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

1,556 citations

Proceedings Article•DOI•

Dark silicon and the end of multicore scaling

[...]

Hadi Esmaeilzadeh¹, Emily Blem², Renee St. Amant³, Karthikeyan Sankaralingam², Doug Burger⁴ - Show less +1 more•Institutions (4)

University of Washington¹, University of Wisconsin-Madison², University of Texas at Austin³, Microsoft⁴

04 Jun 2011

TL;DR: The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community.

...read moreread less

Abstract: Since 2005, processor designers have increased core counts to exploit Moore's Law scaling, rather than focusing on single-core performance. The failure of Dennard scaling, to which the shift to multicore parts is partially a response, may soon limit multicore scaling just as single-core scaling has been curtailed. This paper models multicore scaling limits by combining device scaling, single-core scaling, and multicore scaling to measure the speedup potential for a set of parallel workloads for the next five technology generations. For device scaling, we use both the ITRS projections and a set of more conservative device scaling parameters. To model single-core scaling, we combine measurements from over 150 processors to derive Pareto-optimal frontiers for area/performance and power/performance. Finally, to model multicore scaling, we build a detailed performance model of upper-bound performance and lower-bound core power. The multicore designs we study include single-threaded CPU-like and massively threaded GPU-like multicore chip organizations with symmetric, asymmetric, dynamic, and composed topologies. The study shows that regardless of chip organization and topology, multicore scaling is power limited to a degree not widely appreciated by the computing community. Even at 22 nm (just one year from now), 21% of a fixed-size chip must be powered off, and at 8 nm, this number grows to more than 50%. Through 2024, only 7.9x average speedup is possible across commonly used parallel workloads, leaving a nearly 24-fold gap from a target of doubled performance per generation.

...read moreread less

1,379 citations

Book•

Computer Architecture, Fifth Edition: A Quantitative Approach

[...]

John L. Hennessy, David A. Patterson

29 Sep 2011

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.

...read moreread less

Abstract: The computing world today is in the middle of a revolution: mobile clients and cloud computing have emerged as the dominant paradigms driving programming and hardware innovation today. The Fifth Edition of Computer Architecture focuses on this dramatic shift, exploring the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices. Each chapter includes two real-world examples, one mobile and one datacenter, to illustrate this revolutionary change. Updated to cover the mobile computing revolutionEmphasizes the two most important topics in architecture today: memory hierarchy and parallelism in all its forms.Develops common themes throughout each chapter: power, performance, cost, dependability, protection, programming models, and emerging trends ("What's Next")Includes three review appendices in the printed text. Additional reference appendices are available online.Includes updated Case Studies and completely new exercises.

...read moreread less

984 citations

Journal Article•DOI•

The IoT for smart sustainable cities of the future: An analytical framework for sensor-based big data applications for environmental sustainability

[...]

Simon Elias Bibri¹•Institutions (1)

Norwegian University of Science and Technology¹

01 Apr 2018-Sustainable Cities and Society

TL;DR: This paper proposes a framework which brings together a large number of previous studies on smart cities and sustainable cities, including research directed at a more conceptual, analytical, and overarching level, as well as research on specific technologies and their novel applications to add additional depth to studies in the field of smart sustainable cities.

...read moreread less

436 citations

Journal Article•DOI•

Structural Health Monitoring Framework Based on Internet of Things: A Survey

[...]

C. Arcadius Tokognon¹, Bin Gao¹, Gui Yun Tian¹, Yan Yan¹•Institutions (1)

University of Electronic Science and Technology of China¹

03 Feb 2017-IEEE Internet of Things Journal

TL;DR: A framework for structural health monitoring (SHM) using IoT technologies on intelligent and reliable monitoring is introduced and technologies involved in IoT and SHM system implementation as well as data routing strategy in IoT environment are presented.

...read moreread less

Abstract: Internet of Things (IoT) has recently received a great attention due to its potential and capacity to be integrated into any complex system. As a result of rapid development of sensing technologies such as radio-frequency identification, sensors and the convergence of information technologies such as wireless communication and Internet, IoT is emerging as an important technology for monitoring systems. This paper reviews and introduces a framework for structural health monitoring (SHM) using IoT technologies on intelligent and reliable monitoring. Specifically, technologies involved in IoT and SHM system implementation as well as data routing strategy in IoT environment are presented. As the amount of data generated by sensing devices are voluminous and faster than ever, big data solutions are introduced to deal with the complex and large amount of data collected from sensors installed on structures.

...read moreread less

319 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150

Collapse