scispace - formally typeset
Search or ask a question
Author

Jan Gray

Other affiliations: Intel, University of Waterloo
Bio: Jan Gray is an academic researcher from Microsoft. The author has contributed to research in topics: Block (telecommunications) & Transactional memory. The author has an hindex of 28, co-authored 75 publications receiving 3685 citations. Previous affiliations of Jan Gray include Intel & University of Waterloo.


Papers
More filters
Journal ArticleDOI
TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.
Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost It is challenging to improve all of these factors simultaneously To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA) Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth networkWe describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput In other words, the reconfigurable fabric enables the same throughput using only half the number of servers

835 citations

Journal ArticleDOI
14 Jun 2014
TL;DR: The requirements and architecture of the fabric are described, the critical engineering challenges and solutions needed to make the system robust in the presence of failures are detailed, and the performance, power, and resilience of the system when ranking candidate documents are measured.
Abstract: Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cablesIn this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%

712 citations

Patent
26 May 2004
TL;DR: In this article, a run-time executive of an object management system for managing execution of software components in an object execution environment uses a component context object to store intrinsic context properties related to an associated component.
Abstract: A run-time executive of an object management system for managing execution of software components in an object execution environment uses a component context object to store intrinsic context properties related to an associated component. The run-time executive maintains an implicit association of the component context object with the application component. For example, the context properties can include a client id, an activity id, and a transaction reference. The component context object also provides an interface accessible to the associated component, with member functions for use in transaction processing, in creating additional other application components inheriting component's context properties, and in access control based on abstract user classes (roles).

193 citations

Patent
17 Aug 1998
TL;DR: Intelligent trust management as discussed by the authors provides a centralized security facility that gives system components a flexible mechanism for implementing security policies, including executable code that uses any arguments along with dynamically obtained variable information to make a decision.
Abstract: Intelligent Trust Management provides a centralized security facility that gives system components a flexible mechanism for implementing security policies. System components such as applications create a request describing an action that needs to be checked against an appropriate security policy. The request is given to a trust system that determines which policy object applies to the request, and may pass request arguments to the policy. The policy objects include executable code that uses any arguments along with dynamically obtained variable information to make a decision. The decision is returned to the system component, which then operates accordingly. Policy objects may maintain state and interface with the user independent of the system component in order to obtain information to make their decisions. Policy objects may call other policy objects and/or mathematically combine the results of other policy objects to make a decision.

179 citations

Patent
17 Aug 1998
TL;DR: In this paper, an object system provides composable object execution environment extensions with an object model that defines a framework with contexts, policies, policy makers and activators that act as object creation time, reference creation-time and call-time event sinks to provide processing of effects specific to the environment extensions.
Abstract: An object system provides composable object execution environment extensions with an object model that defines a framework with contexts, policies, policy makers and activators that act as object creation-time, reference creation-time and call-time event sinks to provide processing of effects specific to the environment extensions. At object creation time, an object instantiation service of the object system delegates to the activators to establish a context in which the object is created. The context contains context properties that represent particular of the composable environment extensions in which the object is to execute. The context properties also can act as policy makers that contribute policies to an optimized policy set for references that cross context boundaries. The policies in such optimized sets are issued policy events on calls across the context boundary to process effects of switching between the environment extensions of the two contexts.

164 citations


Cited by
More filters
Posted Content
TL;DR: This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

3,067 citations

Proceedings ArticleDOI
24 Jun 2017
TL;DR: The Tensor Processing Unit (TPU) as discussed by the authors is a custom ASIC deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) using a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS).
Abstract: Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU) --- deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X -- 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X -- 80X higher. Moreover, using the CPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

2,679 citations

Journal ArticleDOI
18 Jun 2016
TL;DR: This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner.
Abstract: A number of recent efforts have attempted to design accelerators for popular machine learning algorithms, such as those involving convolutional and deep neural networks (CNNs and DNNs). These algorithms typically involve a large number of multiply-accumulate (dot-product) operations. A recent project, DaDianNao, adopts a near data processing approach, where a specialized neural functional unit performs all the digital arithmetic operations and receives input weights from adjacent eDRAM banks.This work explores an in-situ processing approach, where memristor crossbar arrays not only store input weights, but are also used to perform dot-product operations in an analog manner. While the use of crossbar memory as an analog dot-product engine is well known, no prior work has designed or characterized a full-fledged accelerator based on crossbars. In particular, our work makes the following contributions: (i) We design a pipelined architecture, with some crossbars dedicated for each neural network layer, and eDRAM buffers that aggregate data between pipeline stages. (ii) We define new data encoding techniques that are amenable to analog computations and that can reduce the high overheads of analog-to-digital conversion (ADC). (iii) We define the many supporting digital components required in an analog CNN accelerator and carry out a design space exploration to identify the best balance of memristor storage/compute, ADCs, and eDRAM storage on a chip. On a suite of CNN and DNN workloads, the proposed ISAAC architecture yields improvements of 14.8×, 5.5×, and 7.5× in throughput, energy, and computational density (respectively), relative to the state-of-the-art DaDianNao architecture.

1,558 citations

Patent
17 Jun 2005
TL;DR: In this paper, the authors present a data processing system having a business object model reflecting the data used during a business transaction, which is suitable for use across industries, across businesses, and across different departments within a business within a transaction.
Abstract: Methods and systems consistent with the present invention provide a data processing system having a business object model reflecting the data used during a business transaction. Consistent interfaces are generated from the business object model. These interfaces are suitable for use across industries, across businesses, and across different departments within a business during a business transaction.

1,431 citations

Patent
30 Jul 1999
TL;DR: In this paper, a plurality of logical business components in a business are first defined, with each business component having plurality of capabilities, and functional interrelationships are identified between the logical business component.
Abstract: A method of generating software based on business components. A plurality of logical business components in a business are first defined with each business component having a plurality of capabilities. Next, functional interrelationships are identified between the logical business components. Code modules are then generated to carry out the capabilities of the logical business components and the functional interrelationships between the logical business components, wherein the code modules represent a transformation of the logical business components to their physical implementation, while ensuring the capabilities that are carried out by each code module are essentially unique to the logical business component associated with the code module. Next, the functional aspects of the code modules and the functional relationships of the code modules are tested. The code modules are then subsequently deployed in an e-commerce environment.

950 citations