scispace - formally typeset
Search or ask a question
Author

Doug Joseph

Bio: Doug Joseph is an academic researcher from IBM. The author has contributed to research in topics: Cache & Markov chain. The author has an hindex of 4, co-authored 4 publications receiving 731 citations.

Papers
More filters
Proceedings ArticleDOI
01 May 1997
TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
Abstract: Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor.This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be effectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.

567 citations

Journal ArticleDOI
TL;DR: The Markov prefetcher acts as an interface between the on-chip and off-chip cache and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two-thirds the memory of a demand-fetch cache organization.
Abstract: Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor. This design results in a prefetching system that provides good coverage, is accurate, and produces timely results that can be effectively used by the processor. We also explored a range of techniques that can be used to reduce the bandwidth demands of prefetching, leading to improved memory system performance. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54 percent for various commercial benchmarks while only using two-thirds the memory of a demand-fetch cache organization.

132 citations

Patent
Doug Joseph1, Thomas D. Lovett1
06 Jun 2001
TL;DR: Secure inter-node communication is disclosed in this paper, where the identification of the first node and identification of a second node are sent to the hardware of the second node, which stores the key and the identifications.
Abstract: Secure inter-node communication is disclosed. The hardware of the first node sends a key, identification of the first node, and identification of a second node to hardware of the second node. The hardware of the second node receives the key and the identifications. The hardware of the second node verifies the identifications of the first and the second nodes, and stores the key. The key stored in the hardware of the first and the second nodes allows for a secure transmission channel from the software of the first node to software of the second node.

43 citations

Patent
09 Apr 2010
TL;DR: In this paper, a system and method for indexing documents in a data storage system includes generating a document hash table in storage memory for a single document using an index construction in a multithreaded and scalable configuration wherein multiple threads are each assigned work to reduce synchronization between threads.
Abstract: A system and method for indexing documents in a data storage system includes generating a single document hash table in storage memory for a single document using an index construction in a multithreaded and scalable configuration wherein multiple threads are each assigned work to reduce synchronization between threads. The single document hash table includes partitioning the single document and indexing strings of partitioned portions of the single document to create a minor hash table for each document sub-part; generating a document level hash table from the minor hash tables; updating a stream level hash table for the strings which maps every string to a global identifier; and generating a term reordered array from the document level hash table.

16 citations

TL;DR: The Markov prefetcher as discussed by the authors is an extension of the simple A → B correlation to a full Markov model of prior memory reference behavior, which has been shown to be effective for unstructured workloads.
Abstract: Processor performance improvements arise from a combination of process or technology changes, programming model changes to more clearly express parallelism and architectural changes to exploit that ILP. All of these factors come with differing costs and computer architects seek to balance design changes changes to reduce costs. In the 1990’s, it was clear that the disparity between CPU and memory speed was an impending issue and microarchitectural techniques such as multi-level caches, varying cache block sizes and cache organizations were important techniques to explore [8]. Later, the “Memory Wall” [19] made clear the disparity between CPU performance and memory speed, both in the instant and the future. Although fundamentally switching the computing model, as called out in the “Memory Wall” paper was possible, there were also opportunities to improve existing architectures through microarchitectural changes. At the time, Doug Joseph was a Ph.D. student at the University of Colorado while also working as a member of the technical staff at IBM working on high performance systems. Dirk Grunwald had been working on improving instruction [11] and data [3] caches. Doug Joseph had an interest an AI and machine learning which is also reflected in his current position on architectural acceleration for deep learning. At the same time, Dirk Grunwald had been working with others on the application of machine-learning to branch prediction using decisions trees [4]. They decided Doug should pursue a thesis based on a preliminary idea that was the gensis of markov prefetching. At the time, there had been extensive work on memory prefetchers that used arithmetic relations between memory addresses and were effective on structured workloads [6], [10], [13], [15]. Research on prefetching for unstructured workloads was less common [5], [12], [14], [20]. The Markov prefetcher that was the core of Doug Joseph’s Ph.D. thesis is a continuing evolution of what has been called correlation-based prefetching [1], [5] which was similar to a method patented by Pomerene et al. [16]. In correlation prefetching, a memory reference A followed by a miss B would create an entry in a “shadow directory“ that recorded that relationship. In [16], the shadow directory was off-chip and the reference stream focused on the miss references from the last-level cache (or references external to the processor). The key idea of Markov prefetching was to extend the simple A → B correlation to a full Markov model of prior memory reference behavior.

Cited by
More filters
Book
10 Sep 2007
TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Abstract: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be? Memory Systems: Cache, DRAM, Disk shows you how to resolve this problem. The book tells you everything you need to know about the logical design and operation, physical design and operation, performance characteristics and resulting design trade-offs, and the energy consumption of modern memory hierarchies. You learn how to to tackle the challenging optimization problems that result from the side-effects that can appear at any point in the entire hierarchy.As a result you will be able to design and emulate the entire memory hierarchy. . Understand all levels of the system hierarchy -Xcache, DRAM, and disk. . Evaluate the system-level effects of all design choices. . Model performance and energy consumption for each component in the memory hierarchy.

659 citations

Proceedings ArticleDOI
01 May 2003
TL;DR: This paper presents a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales, and can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.
Abstract: In a single second a modern processor can execute billions of instructions. Obtaining a bird's eye view of the behavior of a program at these speeds can be a difficult task when all that is available is cycle by cycle examination. In many programs, behavior is anything but steady state, and understanding the patterns of behavior, at run-time, can unlock a multitude of optimization opportunities.In this paper, we present a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales. By examining the proportion of instructions that were executed from different sections of code, we can find generic phases that correspond to changes in behavior across many metrics. By classifying phases generically, we avoid the need to identify phases for each optimization, and enable a unified prediction scheme that can forecast future behavior. Our analysis shows that our design can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.

512 citations

Proceedings ArticleDOI
08 Feb 2003
TL;DR: This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window.
Abstract: Today's high performance processors tolerate long latency operations by means of out-of-order execution. However, as latencies increase, the size of the instruction window must increase even faster if we are to continue to tolerate these latencies. We have already reached the point where the size of an instruction window that can handle these latencies is prohibitively large in terms of both design complexity and power consumption. And, the problem is getting worse. This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window. Runahead execution unblocks the instruction window blocked by long latency operations allowing the processor to execute far ahead in the program path. This results in data being prefetched into caches long before it is needed. On a machine model based on the Intel/spl reg/ Pentium/spl reg/ processor, having a 128-entry instruction window, adding runahead execution improves the IPC (instructions per cycle) by 22% across a wide range of memory intensive applications. Also, for the same machine model, runahead execution combined with a 128-entry window performs within 1% of a machine with no runahead execution and a 384-entry instruction window.

481 citations

Proceedings ArticleDOI
01 Dec 1997
TL;DR: Comparison of context based prediction and stride prediction shows that the higher accuracy of contextbased prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors.
Abstract: The predictability of data values is studied at a fundamental level. Two basic predictor models are defined: Computational predictors perform an operation on previous values to yield predicted next values. Examples we study are stride value prediction (which adds a delta to a previous value) and last value prediction (which performs the trivial identity operation on the previous value); Context Based} predictors match recent value history (context) with previous value history and predict values based entirely on previously observed patterns. To understand the potential of value prediction we perform simulations with unbounded prediction tables that are immediately updated using correct data values. Simulations of integer SPEC95 benchmarks show that data values can be highly predictable. Best performance is obtained with context based predictors; overall prediction accuracies are between 56% and 91%. The context based predictor typically has an accuracy about 20% better than the computational predictors (last value and stride). Comparison of context based prediction and stride prediction shows that the higher accuracy of context based prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors. Among different instruction types, predictability varies significantly. In general, load and shift instructions are more difficult to predict correctly, whereas add instructions are more predictable.

455 citations

Proceedings ArticleDOI
04 Dec 2004
TL;DR: This paper develops L2 cache designs for CMPs that incorporate block migration, stride-based prefetching between L1 and L2 caches, and presents a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% overPrefetching alone.
Abstract: In response to increasing (relative) wire delay, architects have proposed various technologies to manage the impact of slow wires on large uniprocessor L2 caches. Block migration (e.g., D-NUCA and NuRapid) reduces average hit latency by migrating frequently used blocks towards the lower-latency banks. Transmission Line Caches (TLC) use on-chip transmission lines to provide low latency to all banks. Traditional stride-based hardware prefetching strives to tolerate, rather than reduce, latency. Chip multiprocessors (CMPs) present additional challenges. First, CMPs often share the on-chip L2 cache, requiring multiple ports to provide sufficient bandwidth. Second, multiple threads mean multiple working sets, which compete for limited on-chip storage. Third, sharing code and data interferes with block migration, since one processor's low-latency bank is another processor's high-latency bank. In this paper, we develop L2 cache designs for CMPs that incorporate these three latency management techniques. We use detailed full-system simulation to analyze the performance trade-offs for both commercial and scientific workloads. First, we demonstrate that block migration is less effective for CMPs because 40-60% of L2 cache hits in commercial workloads are satisfied in the central banks, which are equally far from all processors. Second, we observe that although transmission lines provide low latency, contention for their restricted bandwidth limits their performance. Third, we show stride-based prefetching between L1 and L2 caches alone improves performance by at least as much as the other two techniques. Finally, we present a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% over prefetching alone.

391 citations