Home
/
Authors
/
Ayose Falcón

Author

Ayose Falcón

Other affiliations: Hewlett-Packard, Polytechnic University of Catalonia, Polytechnic University of Puerto Rico

Bio: Ayose Falcón is an academic researcher from Intel. The author has contributed to research in topics: Branch predictor & Thread (computing). The author has an hindex of 13, co-authored 37 publications receiving 772 citations. Previous affiliations of Ayose Falcón include Hewlett-Packard & Polytechnic University of Catalonia.

Topics: Branch predictor, Thread (computing), Cache, Convolution, Instructions per cycle ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

COTSon: infrastructure for full system simulation

[...]

Eduardo Argollo¹, Ayose Falcón¹, Paolo Faraboschi¹, Matteo Monchiero¹, Daniel Ortega¹ - Show less +1 more•Institutions (1)

Hewlett-Packard¹

01 Jan 2009-Operating Systems Review

TL;DR: COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss, and abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.

...read moreread less

Abstract: Simulation has historically been the primary technique used for evaluating the performance of new proposals in computer architecture. Speed and complexity considerations have traditionally limited its applicability to single-thread processors running application-level code. This is no longer sufficient to model modern multicore systems running the complex workloads of commercial interest today.COTSon is a simulator framework jointly developed by HP Labs and AMD. The goal of COTSon is to provide fast and accurate evaluation of current and future computing systems, covering the full software stack and complete hardware models. It targets cluster-level systems composed of hundreds of commodity multicore nodes and their associated devices connected through a standard communication network. COTSon adopts a functional-directed philosophy, where fast functional emulators and timing models cooperate to improve the simulation accuracy at a speed sufficient to simulate the full stack of applications, middleware and OSs.This paper describes the changes in simulation philosophy we embraced in COTSon to address these new challenges. We base functional emulation on established, fast and validated tools that support commodity OSs and complex multitier applications. Through a robust interface between the functional and timing domain, we can leverage other existing simulators for individual sub-components, such as disks or networks. We abandon the idea of "always-on" cycle-based simulation in favor of statistical sampling approaches that can trade accuracy for speed.COTSon opens up a new dimension in the speed/accuracy space, allowing simulation of a cluster of nodes several orders of magnitude faster with a minimal accuracy loss.

...read moreread less

206 citations

Journal Article•DOI•

How to simulate 1000 cores

[...]

Matteo Monchiero¹, Jung Ho Ahn¹, Ayose Falcón¹, Daniel Ortega¹, Paolo Faraboschi¹ - Show less +1 more•Institutions (1)

Hewlett-Packard¹

23 Jul 2009-ACM Sigarch Computer Architecture News

TL;DR: A novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores, which captures the intrinsic behavior of the SPLASH-2 suite, even when the number of shared- memory cores beyond the thousand-core limit is scaled up.

...read moreread less

Abstract: This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

...read moreread less

70 citations

Patent•

Weight-shifting mechanism for convolutional neural networks

[...]

Ayose Falcón¹, Marc Lupon¹, Herrero Abellanas Enric¹, Pedro Lopez¹, Fernando Latorre¹, Frederico Pratas¹, Georgios Tournavitis¹ - Show less +3 more•Institutions (1)

Intel¹

22 Jul 2014

TL;DR: A processor includes a processor core and a calculation circuit as discussed by the authors, which includes logic to determine a set of weights for use in a convolutional neural network (CNN) calculation and scale up the weights using a scale value.

...read moreread less

Abstract: A processor includes a processor core and a calculation circuit. The processor core includes logic determine a set of weights for use in a convolutional neural network (CNN) calculation and scale up the weights using a scale value. The calculation circuit includes logic to receive the scale value, the set of weights, and a set of input values, wherein each input value and associated weight of a same fixed size. The calculation circuit also includes logic to determine results from convolutional neural network (CNN) calculations based upon the set of weights applied to the set of input values, scale down the results using the scale value, truncate the scaled down results to the fixed size, and communicatively couple the truncated results to an output for a layer of the CNN.

...read moreread less

58 citations

Patent•

Storage device and method for performing convolution operations

[...]

Herrero Abellanas Enric¹, Georgios Tournavitis¹, Frederico Pratas¹, Marc Lupon¹, Fernando Latorre¹, Pedro Lopez¹, Ayose Falcón¹ - Show less +3 more•Institutions (1)

Intel¹

22 Sep 2015

TL;DR: In this paper, a storage device and method for performing convolution operations is described, which comprises a plurality of processing units to execute convolution operation on input data and partial results.

...read moreread less

Abstract: A storage device and method are described for performing convolution operations. For example, one embodiment of an apparatus to perform convolution operations comprises a plurality of processing units to execute convolution operations on input data and partial results; a unified scratchpad memory comprising a plurality of memory banks communicatively coupled to the plurality of processing units through a plurality of read/write ports, each of the plurality of memory banks partitioned to store both the input data and partial results; a control unit to allocate the input data and partial results to the memory banks to ensure a minimum quality of service in accordance with the specified number of read/write ports and the specified convolution operation to be performed.

...read moreread less

53 citations

Patent•

Method and apparatus for distributed and cooperative computation in artificial neural networks

[...]

Frederico Pratas¹, Ayose Falcón¹, Marc Lupon¹, Fernando Latorre¹, Pedro Lopez¹, Enric Herrero Abellanas¹, Georgios Tournavitis¹ - Show less +3 more•Institutions (1)

Intel¹

19 Nov 2015

TL;DR: In this paper, an apparatus and method for distributed and cooperative computation in artificial neural networks is described, which comprises an input/output (I/O) interface, a plurality of processing units communicatively coupled to the I/O interface to receive data for input neurons and synaptic weights associated with each of the input neurons, each unit processing at least a portion of the data for the inputs and weights to generate partial results.

...read moreread less

Abstract: An apparatus and method are described for distributed and cooperative computation in artificial neural networks. For example, one embodiment of an apparatus comprises: an input/output (I/O) interface; a plurality of processing units communicatively coupled to the I/O interface to receive data for input neurons and synaptic weights associated with each of the input neurons, each of the plurality of processing units to process at least a portion of the data for the input neurons and synaptic weights to generate partial results; and an interconnect communicatively coupling the plurality of processing units, each of the processing units to share the partial results with one or more other processing units over the interconnect, the other processing units using the partial results to generate additional partial results or final results. The processing units may share data including input neurons and weights over the shared input bus.

...read moreread less

52 citations

1
2
3
4
…
5
6
7
8

Collapse

Cited by

PDF

Open Access

More filters

Journal Article•DOI•

Corona: System Implications of Emerging Nanophotonic Technology

[...]

Dana M. Vantrease¹, Robert Schreiber², Matteo Monchiero², Moray McLaren², Norman P. Jouppi², Marco Fiorentino², Al Davis³, Nathan Binkert², Raymond G. Beausoleil², Jung Ho Ahn² - Show less +6 more•Institutions (3)

University of Wisconsin-Madison¹, Hewlett-Packard², University of Utah³

01 Jun 2008

TL;DR: This work believes that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power.

...read moreread less

Abstract: We expect that many-core microprocessors will push performance per chip from the 10 gigaflop to the 10 teraflop range in the coming decade. To support this increased performance, memory and inter-core bandwidths will also have to scale by orders of magnitude. Pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are significant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet these off- and on-stack bandwidth requirements at acceptable power levels. Corona is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Its peak floating-point performance is 10 teraflops. Dense wavelength division multiplexed optically connected memory modules provide 10 terabyte per second memory bandwidth. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth. We have simulated a 1024 thread Corona system running synthetic benchmarks and scaled versions of the SPLASH-2 benchmark suite. We believe that in comparison with an electrically-connected many-core alternative that uses the same on-stack interconnect power, Corona can provide 2 to 6 times more performance on many memory intensive workloads, while simultaneously reducing power.

...read moreread less

688 citations

Proceedings Article•DOI•

Graphite: A distributed parallel simulator for multicores

[...]

Jason E. Miller¹, Harshad Kasture¹, George Kurian¹, Charles Gruenwald¹, Nathan Beckmann¹, Christopher Celio¹, Jonathan Eastep¹, Anant Agarwal¹ - Show less +4 more•Institutions (1)

Massachusetts Institute of Technology¹

01 Apr 2010

TL;DR: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure and demonstrates that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers with near linear speedup.

...read moreread less

Abstract: This paper introduces the Graphite open-source distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multi-core processors containing dozens, hundreds, or even thousands of cores. It provides high performance for fast design space exploration and software development. Several techniques are used to achieve this including: direct execution, seamless multicore and multi-machine distribution, and lax synchronization. Graphite is capable of accelerating simulations by distributing them across multiple commodity Linux machines. When using multiple machines, it provides the illusion of a single process with a single, shared address space, allowing it to run off-the-shelf pthread applications with no source code modification. Our results demonstrate that Graphite can simulate target architectures containing over 1000 cores on ten 8-core servers. Performance scales well as more machines are added with near linear speedup in many cases. Simulation slowdown is as low as 41× versus native execution.

...read moreread less

498 citations

Proceedings Article•DOI•

Disaggregated memory for expansion and sharing in blade servers

[...]

Kevin Lim¹, Jichuan Chang², Trevor Mudge¹, Parthasarathy Ranganathan², Steven K. Reinhardt³, Thomas F. Wenisch¹ - Show less +2 more•Institutions (3)

University of Michigan¹, Hewlett-Packard², Advanced Micro Devices³

20 Jun 2009

TL;DR: It is demonstrated that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by the solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.

...read moreread less

Abstract: Analysis of technology and application trends reveals a growing imbalance in the peak compute-to-memory-capacity ratio for future servers. At the same time, the fraction contributed by memory systems to total datacenter costs and power consumption during typical usage is increasing. In response to these trends, this paper re-examines traditional compute-memory co-location on a single system and details the design of a new general-purpose architectural building block-a memory blade-that allows memory to be "disaggregated" across a system ensemble. This remote memory blade can be used for memory capacity expansion to improve performance and for sharing memory across servers to reduce provisioning and power costs. We use this memory blade building block to propose two new system architecture solutions-(1) page-swapped remote memory at the virtualization layer, and (2) block-access remote memory with support in the coherence hardware-that enable transparent memory expansion and sharing on commodity-based systems. Using simulations of a mix of enterprise benchmarks supplemented with traces from live datacenters, we demonstrate that memory disaggregation can provide substantial performance benefits (on average 10X) in memory constrained environments, while the sharing enabled by our solutions can improve performance-per-dollar by up to 57% when optimizing memory provisioning across multiple servers.

...read moreread less

423 citations

Journal Article•DOI•

A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies

[...]

Shyamkumar Thoziyoor¹, Jung Ho Ahn², Matteo Monchiero², Jay B. Brockman¹, Norman P. Jouppi² - Show less +1 more•Institutions (2)

University of Notre Dame¹, Hewlett-Packard²

01 Jun 2008

TL;DR: This study uses CACTI-D to model all components of the memory hierarchy including L1, L2, last level SRAM, logic process based DRAM or commodity DRAM L3 caches, and main memory DRAM chips and finds that commodity DRam technology is most attractive for stacked last level caches, with significantly lower energy-delay products.

...read moreread less

Abstract: In this paper we introduce CACTI-D, a significant enhancement of CACTI 5.0. CACTI-D adds support for modeling of commodity DRAM technology and support for main memory DRAM chip organization. CACTI-D enables modeling of the complete memory hierarchy with consistent models all the way from SRAM based L1 caches through main memory DRAMs on DIMMs. We illustrate the potential applicability of CACTI-D in the design and analysis of future memory hierarchies by carrying out a last level cache study for a multicore multithreaded architecture at the 32nm technology node. In this study we use CACTI-D to model all components of the memory hierarchy including L1, L2, last level SRAM, logic process based DRAM or commodity DRAM L3 caches, and main memory DRAM chips. We carry out architectural simulation using benchmarks with large data sets and present results of their execution time, breakdown of power in the memory hierarchy, and system energy-delay product for the different system configurations. We find that commodity DRAM technology is most attractive for stacked last level caches, with significantly lower energy-delay products.

...read moreread less

249 citations

Operating system support for NVM+DRAM hybrid main memory

[...]

Jeffrey C. Mogul¹, Eduardo Argollo¹, Mehul A. Shah¹, Paolo Faraboschi¹•Institutions (1)

Hewlett-Packard¹

18 May 2009

TL;DR: Preliminary experiments suggesting that this approach to building main memory as a hybrid between DRAM and non-volatile memory, such as flash or PC-RAM, is viable are described.

...read moreread less

Abstract: Technology trends may soon favor building main memory as a hybrid between DRAM and non-volatile memory, such as flash or PC-RAM. We describe how the operating system might manage such hybrid memories, using semantic information not available in other layers. We describe preliminary experiments suggesting that this approach is viable.

...read moreread less

248 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139

Collapse