scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A multithreaded processor designed for distributed shared memory systems

19 Mar 1997-pp 206-213
TL;DR: This work presents two techniques to reduce the context switch cost to at most one processor cycle: A context switch is explicitly coded in the opcode, and a context switch buffer is used.
Abstract: The multithreaded processor-called Rhamma-uses a fast context switch to bridge latencies caused by memory accesses or by synchronization operations. Load/store, synchronization, and execution operations of different threads of control are executed simultaneously by appropriate functional units. A fast context switch is performed whenever a functional unit comes across an operation that is destined for another unit. The overall performance depends on the speed of the context switch. We present two techniques to reduce the context switch cost to at most one processor cycle: A context switch is explicitly coded in the opcode, and a context switch buffer is used. The load/store unit shows up as the principal bottleneck. We evaluate four implementation alternatives of the load/store unit to increase processor performance.

Summary (2 min read)

1. Introduction

  • The access of remote data and the synchronization of threads cause processor idle times.
  • The authors further implement suitable synchronization primitives that prevent busy waiting.
  • The processor should be able to bridge memory latencies and synchronization waiting times so efficiently that it could also be applied in a single-processor workstation.

2. The Processor Architecture

  • Each unit stops the execution of a thread when its decode stage recognizes an instruction intended for another unit.
  • To perform a context switch the unit passes the thread tag to the FIFO buffer of the unit that is appro-priate for the execution of the instruction.
  • Then the unit resumes processing with another thread of its own FIFO buffer.
  • The units execute different threads of control.
  • Therefore, they access different activation frames and thus different register sets.

4. Alternative Implementations of the Load/Store Unit

  • After sending a load/store request to the memory, the next data independent instructions are scheduled, also known as Combining interleaving and overlapping.
  • In the case that the load/store unit has to stall for a dependent instruction, the unit switches the thread of control.

5. The Simulator

  • As simulation workload the authors applied several small application programs written in Modula-2.
  • The applications were compiled to the machine language of DLX and to the extended machine language of Rhamma.
  • For the simulations presented in this paper the authors chose a set of synthetic benchmark programs.
  • The workload is characterized by 100 000 instructions, three threads, and a rate of one load/store instruction to three execution instructions.
  • The number of data independent succeeding instructions is two.

6. Simulation Results

  • The three multithreading techniques perform in the range of realistic cache hit rates from 10% to 60% as well as conventional processors with a very good cache hit rate of 80% or higher.
  • The configurations a. -c. and d1. -d3. are shown in single diagrams and compared with the performance of a conventional processor, either with stalling or with overlapping load/store unit implementations.
  • Removing the remote memory, the diagram illustrates the memory hierarchy behavior of a system with a single processor.
  • But the conventional processor with overlapping load/store and execution instructions performs better than the simple multithreading approach for systems without cache or with low cache hit rates.

7. Simulation Results

  • The authors presented a multithreaded processor which uses fast context switching to bridge latencies caused by memory accesses or synchronization operations.
  • While the access time can be fully bridged by multithreading, the cycle time proves as the critical parameter.
  • The more sophisticated load/store unit implementations increase the performance of the multithreaded processor.
  • As the next step after the software simulation the authors developed a VHDL implementation of Rhamma.
  • The authors are working towards a hardware prototype.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Multithreaded Processor Designed for Distributed Shared Memory Systems
Winfried Grünewald and Theo Ungerer
Department of Computer Design and Fault Tolerance, University of Karlsruhe, 76128 Karlsruhe, Germany,
Phone +721-608-6048, Fax + 721-370455, Email: {gruenewald, ungerer}@informatik.uni-
karlsruhe.de
Abstract
The multithreaded processor
called Rhamma
uses a fast
context switch to bridge latencies caused by memory
accesses or by synchronization operations. Load/store, syn-
chronization, and execution operations of different threads of
control are executed simultaneously by appropriate func-
tional units. A fast context switch is performed whenever a
functional unit comes across an operation that is destined for
another unit. The overall performance depends on the speed
of the context switch. We present two techniques to reduce
the context switch cost to at most one processor cycle: A con-
text switch is explicitly coded in the opcode, and a context
switch buffer is used. The load/store unit shows up as the
principal bottleneck. We evaluate four implementation alter-
natives of the load/store unit to increase processor perfor-
mance.
1. Introduction
Currently standard or application specific microproces-
sors are used as nodes for multiprocessor systems. Stan-
dard microprocessors are developed and optimized for
microcomputers or workstations with a single processor
or with a low number of processors tied to a common bus.
The use of standard microprocessors limits the scalability
of shared memory multiprocessor systems unless provi-
sions are made to bridge latencies caused by remote mem-
ory accesses or by synchronization operations. Because of
the small market segment of multiprocessor systems,
designing microprocessors specifically for use in multipro-
cessors is expensive.
Our research project aims at the development of a pro-
cessor which is suitable for a node in a distributed shared
memory system (DSM) as well as in a uniprocessor sys-
tem. The storage of a DSM system is physically distrib-
uted, but all processors share a common address space. As
a consequence, memory access time depends on the loca-
tion of the accessed data. The data can be in the processor
cache, the local memory, or the remote memory.
The access of remote data and the synchronization of
threads cause processor idle times. It is the object of our
research to fill these idle times by switching extremely
fast to another thread of control. We further implement
suitable synchronization primitives that prevent busy wait-
ing. The processor should be able to bridge memory laten-
cies and synchronization waiting times so efficiently that
it could also be applied in a single-processor workstation.
Related approaches are:
the finely grained multithreaded processors HEP [1],
Horizon [2] and Tera systems [3], that switch context
on every instruction,
the block multithreaded processors Sparcle of the MIT
Alewife machine [4], MSparc [5] and MTA [6],
the multithreaded superscalar processors developed at
the Media Research Laboratory of Matsushita Electric
Industrial Co. [7], at the University of California,
Irvine [8], at the University of Karlsruhe [9], and the
simultaneous multithreaded processor of the University
of Washington [10], and
the decoupled access/execute architecture DAE [11],
which splits instruction processing of a single thread of
control into memory access and execution tasks, exe-
cuted by different units, that communicate via archi-
tectural queues.
Our approach is most similar to the Sparcle and
MSparc processors which switch the context on a cache
miss. However, the execution unit of our processor
switches the context whenever it comes across a load,
store or synchronization instruction, and the load/store
unit switches whenever it meets an execution or synchroni-
zation instruction. In contrast to Sparcle, the context
switch is triggered by the decode unit in an early stage of
the pipeline, thus decreasing context switching time. On
the other hand, the overall performance of our processor
may suffer from the higher rate of context switches unless
the context switch time is very small. Implementation
alternatives for a very fast context switch are presented.

Figure 1: Microarchitecture of the Rhamma processor
register
sets
sync
unit
load
/s
tore
unit
execution
unit
sync
requests
thread
tags
memory interface
thread
tags
thread
tags
sync
requests/
thread
tags
load/store
acknowledge-
ments
thread
tags
2. The Processor Architecture
The main idea is to remove all operations that may
cause active waiting from the execution unit. Therefore,
load, store and synchronization operations are performed
by different units within the processor. We distinguish
idle times caused by memory accesses from idle times
caused by synchronization operations. The former depend
on the memory hierarchy of a DSM system, and idle
times are predictable within a time period varied by net-
work access conflicts. The latter depend on the program
execution and are non predictable. We assign a unit for
the load and store operations — the load/store unit — and
another unit for the synchronization operations — the
sync unit. The execution unit processes the arithmetic-
logic and the control instructions. Each of these units exe-
cutes instructions from another thread. The units are cou-
pled by FIFO buffers and access different register sets.
The microarchitecture of the multithreaded processor is
shown in figure 1.
A unique thread tag identifies the thread. An activa-
tion frame is assigned to each thread holding thread-local
data, e.g. the program counter, the thread tag, and other
state information. The activation frames are physically dis-
tributed to the register sets. If more activation frames
exist than register sets are available, activation frames of
blocked threads are stored in the memory.
Each unit stops the execution of a thread when its
decode stage recognizes an instruction intended for
another unit. To perform a context switch the unit passes
the thread tag to the FIFO buffer of the unit that is appro-
priate for the execution of the instruction. Then the unit
resumes processing with another thread of its own FIFO
buffer. The units execute different threads of control.
Therefore, they access different activation frames and
thus different register sets. A fast context switch is real-
ized by simply switching to another register set. A more
detailed microarchitecture description of the multi-
threaded processor is given in [12]. In the following, we
omit the sync unit and concentrate on the load/store and
execution units.
3. Fast Context Switch
In general, using a five stage processor pipeline (e.g.
instruction fetch, decode, operand fetch, execution, write
back) a context switch is recognized in the decode stage.
This unnecessary decoding costs one cycle. We allow for
access to the new thread tag and loading the new instruc-
tion pointer from the thread tag each by an additional
cycle. The first instruction of the new thread is decoded
after two further cycles — thus context switching over-
head sums up to 5 cycles.
Besides the software simulations presented in this
paper, we implemented the Rhamma architecture in
VHDL and optimized the hardware towards a fast context
switch. We obtained a context switch cost of at most one
processor cycle by applying two optimizations:
One technique is to code the context switch explicitly
in the first opcode bit of the instruction. A complete
decoding is not necessary to recognize a context switch.
The instruction fetch stage already recognizes the con-
text switch itself, and the context switch just costs the
cycle to fetch the instruction.
The second technique applies a context switch buffer,
which is similar to branch buffers in modern micropro-
cessors. The context switch buffer is a small table in the
execution unit, which holds the addresses of the most
recently used load/store instructions. If the address of
the next instruction to be fetched matches with an
address in the context switch buffer, a context switch is
performed immediately. In this case context switching
time is reduced to zero. Otherwise the first method is
used. Our simulations with real work loads have shown,
that only a little buffer with about 32 entries is required.
The context switch buffer is also suitable if the instruc-
tion fetch costs more than one processor cycle as usual
in modern processors.

4. Alternative Implementations of the Load/Store Unit
The main bottleneck of each high-performance proces-
sor is the unit executing load and store instructions. In a
multithreaded processor the load/store bottleneck is even
more essential than in a conventional processor because
of the higher throughput of data. Multithreading, however,
allows new possibilities to solve the load/store bottleneck.
We studied four implementation alternatives:
Stalling: The simplest implementation is to issue a
load or store request to the memory interface and then
wait for the load/store acknowledgement that proves
completion of the memory access before the next
instruction is scheduled.
Interleaving: the load/store unit switches the thread of
control after each load or store request. A load or store
instruction of another thread can be scheduled. The
succeeding instructions of the switched thread are exe-
cuted after receiving the acknowledgement correspon-
ding to the memory request.
Overlapping: One or several load/store requests are
sent to the memory. Then the thread tag is handed
over to the execution unit or synchronization unit,
respectively. The next execution instructions are exe-
cuted if the instructions are data independent from the
previous ones.
Combining interleaving and overlapping: After sen-
ding a load/store request to the memory, the next data
independent instructions are scheduled. In the case
that the load/store unit has to stall for a dependent
instruction, the unit switches the thread of control.
5. The Simulator
We use an event driven simulation at the register trans-
fer level of the Rhamma processor which is able to per-
form the behavior of a single processor system or a
memory coupled multiprocessor system. The execution
unit is a processor based on the DLX processor of the uni-
versity of Stanford [13]. The DLX processor is a conven-
tional RISC processor with a five stage pipeline. The
DLX instruction set is extended by synchronization and
thread management instructions. In our simulations we
evaluate our multithreaded Rhamma processor vs. a con-
ventional processor without multithreading represented by
the original DLX processor, and vs. a multithreaded pro-
cessor with context switching on cache miss similar to the
Sparcle/MSparc processor. The latter is also based on the
DLX processor and uses a one cycle context switch — in
contrast to the original Sparcle processor, that needs 14
cycles for a context switch. A one cycle context switch
will be difficult to implement.
We assume one simulation time step per pipeline stage
for each instruction execution and for the access to the
instruction memory. The access to a FIFO queue and the
minimum delay time the data has to stay in a FIFO queue
is also one simulation time step.
We vary
the thread switching cost: the number of time steps nec-
essary to switch the execution unit or the load/store
unit to another thread of control,
the access time(s): the amount of time steps from a
memory request to its completion,
the cycle time(s): the minimum number of time steps
between two memory accesses, and
the hit rate(s): percentage of memory requests served
by the cache, the local memory, or remote memory.
Depending upon the memory hierarchy we distinguish
access times, cycle times and hit rates of the cache, local
memory and remote memory within a DSM system. We
assume split transactions on the network of the DSM sys-
tem. Therefore, the remote memory cycle time is chosen
as fraction of the remote memory access time. Access and
cycle times are shown in table 1. We vary the cache hit
rate and the local memory access rate.
As explained in section III, the context switch of our
VHDL implementation of Rhamma costs at most a single
cycle and is reduced to zero if a context switch buffer
entry matches. For the software simulations we used a con-
text switch cost of one simulation time step.
As simulation workload we applied several small appli-
cation programs written in Modula-2. The applications
were compiled to the machine language of DLX and to
the extended machine language of Rhamma. For the simu-
lations presented in this paper we chose a set of synthetic
benchmark programs. The workload is characterized by
100 000 instructions, three threads, and a rate of one
load/store instruction to three execution instructions. The
number of data independent succeeding instructions is
two. This simulation workload does not contain synchroni-
zation instructions.
Table 1: Access and cycle times
1/1
25/25
100/25
access cycle
time time
cache 1 1
local memory 25 25
remote memory
100 25

Figure 2: Simulation diagrams for the four load/store unit implementation alternatives
0%
25%
50%
75%
100%
100%
65%
30%
0
500000
1000000
1500000
2000000
2500000
3000000
stalling store/load unit
0%
25%
50%
75%
100%
100%
65%
30%
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1000000
store/load unit with interleaving
0%
25%
50%
75%
100%
100%
65%
30%
0
100000
200000
300000
400000
500000
600000
700000
800000
store/load unit with overlapping
0%
25%
50%
75%
100%
100%
65%
30%
0
100000
200000
300000
400000
500000
600000
700000
800000
load/store with overlapping and interleaving
cache
l
ocal memory
time steps
6. Simulation Results
Various configurations of multiprocessors were simu-
lated. The four diagrams in figure 2 vary the cache hit rate
and the local memory access rate. The remote memory
access rate results from these two rates. The vertical axis
shows the yielded simulation time steps for the executed
benchmark program.
As can be seen easily, the stalling load/store unit
performs worst, and the combined approach (bottom right)
gives the best performance (note the different scales on the
vertical axis). The interleaving and the overlapping tech-
niques are intermediate and not in a direct order to each other.
This is because of the convex and concave crookedness of
the planes formed by the tops of the small bars. If we analyze
the edges of the planes in the four figures, the following con-
figurations of multiprocessor systems are represented:
a. The front edge, running from the lower rightmost side
to the leftmost side, represents configurations without
remote memory - a single processor system without
cache (the leftmost bar) or with cache and different
cache hit rates.
b. The back edge, running from the uppermost bar in the
middle of the diagram to the rightmost bar, removes
the local memory from the simulations – representing
cache-only DSM multiprocessors like the KSR-
machines of Kendall Square Research, or NUMA (non

The strong inclination of the graphs is mainly caused
by the influence of the cache hit rate on the overall
performance. The multithreaded processor bridges
only a part of the memory latency, because the cycle
time is equal to the memory access time. If the cycle
time is smaller than the access time, the multithread-
ing approaches perform even better than shown in
Figure 3.
b. Removing the local memory from the nodes, we rep-
resent a cache-only DSM multiprocessor or a NUMA
multiprocessor with caches and remote memory (fig-
ure 4). The three more complex load/store unit
approaches perform much better than the two conven-
tional processors and the simple multithreading
approach, especially when cache hit rate is low or the
cache is missing. The Sparcle/MSparc approach per-
forms best for cache hit rates up to 60%. The angle
between the upper three and the lower four graphs in
Figure 4 is caused by the ratio of cycle time to access
time.
The realistic hit rates for caches in multiproces-
sor systems range from 10% to 60%. The graph of
the conventional processor with overlapping is typi-
cal for a cache-only multiprocessor system like the
KSR, which does not use a multithreaded processor
to bridge memory latencies. Here, the advantage of a
multithreaded processor is overwhelming.
The three multithreading techniques perform in
the range of realistic cache hit rates from 10% to
60% as well as conventional processors with a very
good cache hit rate of 80% or higher. Even multi-
threading without cache is as good as a conventional
processor with a cache hit rate of 65%. Thus, multi-
threading can replace expensive cache memories.
Figure 3: no remote memory (configuration a.)
Figure 4: no local memory (configuration b.)
-uniform memory access) multiprocessor systems with
caches and a single global memory.
c. The back edge, running from the uppermost bar in the
middle of the diagram to the leftmost bar, represents
configurations with local and remote memory but with-
out cache - representing NUMA multiprocessors like
e.g. DSM multiprocessors without caches.
d. All bars, not at an edge, represent DSM multipro-
cessors with caches or shared memory multiprocessors
with caches, local and global memory. A realistic hit
rate assumption is 60% cache hits, 10% local memory
and 30% remote memory accesses, easily discovered
as a specific single bar in the diagrams. Starting from
this bar, we vary:
d1. the cache hit rate and the local memory access
rate (remote memory access fixed at 30%),
d2. the cache hit rate and the remote memory access
rate (local memory access rate fixed at 10%), and
d3. the local and the remote memory access rates
(cache hit rate fixed at 60%).
The configurations a. – c. and d1. – d3. are shown in
single diagrams (figures 3 – 8) and compared with the per-
formance of a conventional processor, either with stalling
or with overlapping load/store unit implementations.
a. Removing the remote memory, the diagram (figure 3)
illustrates the memory hierarchy behavior of a system
with a single processor. The multithreaded processor
with all four load/store unit implementations performs
better than the conventional stalling processor. But the
conventional processor with overlapping load/store and
execution instructions performs better than the simple
multithreading approach for systems without cache or
with low cache hit rates. The realistic cache hit rates
range from 60% to 95%. In this region the combined
approach performs best and much better than the con-
ventional processors.
0
100000
200000
300000
400000
500000
600000
700000
800000
stalling
interleaving
overlapping
combined
stalling
overlapping
100%90%80%70%60%50%40%30%20%10%0%
cache hit rate
Rhamma:
conventional:
cache hit rate (no local memory)
0
500000
1000000
1500000
2000000
2500000
3000000
remote miss
stalling
interleaving
overlapping
combined
stalling
overlapping
100%90%80%70%60%50%40%30%20%10%0%
Rhamma:
conventional:
Sparcle

Citations
More filters
Journal ArticleDOI
TL;DR: This survey paper explains and classifies the explicit multithreaded techniques in research and in commercial microprocessors.
Abstract: Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads of control are often stored in separate on-chip register sets. Unused instruction slots, which arise from latencies during the pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor. The execution units are multiplexed between the thread contexts that are loaded in the register sets.Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously.Explicit multithreaded processors are multithreaded processors that apply processes or operating system threads in their hardware thread slots. These processors optimize the throughput of multiprogramming workloads rather than single-thread performance. We distinguish these processors from implicit multithreaded processors that utilize thread-level speculation by speculatively executing compiler- or machine-generated threads of control that are part of a single sequential program.This survey paper explains and classifies the explicit multithreading techniques in research and in commercial microprocessors.

319 citations

Journal ArticleDOI
TL;DR: The paper shows that unification of von Neumann and dataflow models is possible and preferred to treating them as two unrelated, orthogonal computing paradigms.
Abstract: The paper presents an overview of the parallel computing models, architectures, and research projects that are based on asynchronous instruction scheduling. It starts with pure dataflow computing models and presents an historical development of several ideas (i.e. single-token-per-arc dataflow, tagged-token dataflow, explicit token store, threaded dataflow, large-grain dataflow, RISC dataflow, cycle-by-cycle interleaved multithreading, block interleaved multithreading, simultaneous multithreading) that resulted in modern multithreaded superscalar processors. The paper shows that unification of von Neumann and dataflow models is possible and preferred to treating them as two unrelated, orthogonal computing paradigms. Today's dataflow research incorporates more explicit notions of state into the architecture, and von Neumann models using many dataflow techniques to improve the latency hiding aspects of modern multithreaded systems.

86 citations

Journal ArticleDOI
TL;DR: The results show that SDF architecture can outperform the superscalar and scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.
Abstract: In this paper, the scheduled dataflow (SDF) architecture-a decoupled memory/execution, multithreaded architecture using nonblocking threads-is presented in detail and evaluated against superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the thread's execution. Data is preloaded into the thread's context (registers) and all results are poststored after the completion of the thread's execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

83 citations

Proceedings ArticleDOI
09 Dec 2006
TL;DR: This paper defines the fairness metric using the ratio of the individual threads' speedups, and shows how it can be enforced in switch on event multithreading, and analyze the impact of the fairness enforcement mechanism on throughput.
Abstract: The need to reduce power and complexity will increase the interest in Switch on Event multithreading (coarse grained multithreading). Switch On Event multithreading is a low power and low complexity mechanism to improve processor throughput by switching threads on execution stalls. Fairness may, however, become a problem in a mul- tithreaded processor. Unless fairness is properly handled, some threads may starve while others consume all of the processor cycles. Heuristics that were devised in order to improve fairness in Simultaneous Multithreading are not applicable to Switch On Event multithreading. This paper defines the fairness metric using the ratio of the individ- ual threads' speedups, and shows how it can be enforced in Switch On Event multithreading. Fairness is controlled by forcing additional thread switch points. These switch points are determined dynamically by runtime estimation of the single threaded performance of each of the individ- ual threads. We analyze the impact of the fairness enforce- ment mechanism on throughput. We present simulation re- sults of the performance of Switch on Event multithread- ing. Switch on Event multithreading achieves an average speedup over single thread of 24% when no fairness is en- forced. In this case, over a third of our runs achieved poor fairness in which one thread ran extremely slowly (10 to 100 times slower than its single thread performance) while the other thread's performance was hardly affected. By using the proposed mechanism we can guarantee fairness of 1/4, 1/2 and 1 for a small performance loss of 2.2%, 3.7% and 7.2% respectively.

56 citations


Cites background from "A multithreaded processor designed ..."

  • ...Several research projects studied SOE multithreading [1, 2, 17, 32], but none of them dealt with the fairness problem....

    [...]

Proceedings ArticleDOI
12 Oct 1999
TL;DR: A multithreaded Java microcontroller with a new hardware event handling mechanism that allows handling of simultaneous overlapping events with hard real-time requirements and evaluates the basic architectural attributes using real- time event parameters of an autonomous guided vehicle.
Abstract: We propose a multithreaded Java microcontroller (called Komodo microcontroller) with a new hardware event handling mechanism that allows handling of simultaneous overlapping events with hard real-time requirements Real-time Java threads are used as interrupt service threads (ISTs) instead of interrupt service routines (ISRs) Our proposed Komodo microcontroller supports multiple ISTs with zero-cycle context switching overhead We evaluate the basic architectural attributes using real-time event parameters of an autonomous guided vehicle When calculating the maximum vehicle speed without violating the real-time constraints, ISTs dominate ISRs by a speed increase of 28%

44 citations

References
More filters
Book
01 Dec 1989
TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Abstract: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today. In this edition, the authors bring their trademark method of quantitative analysis not only to high-performance desktop machine design, but also to the design of embedded and server systems. They have illustrated their principles with designs from all three of these domains, including examples from consumer electronics, multimedia and Web technologies, and high-performance computing.

11,671 citations

Proceedings ArticleDOI
01 May 1995
TL;DR: Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multi-threading, and is an attractive alternative to single-chip multiprocessors.
Abstract: This paper examines simultaneous multithreading, a technique permitting several independent threads to issue instructions to a superscalar's multiple functional units in a single cycle. We present several models of simultaneous multithreading and compare them with alternative organizations: a wide superscalar, a fine-grain multithreaded processor, and single-chip, multiple-issue multiprocessing architectures. Our results show that both (single-threaded) superscalar and fine-grain multithreaded architectures are limited their ability to utilize the resources of a wide-issue processor. Simultaneous multithreading has the potential to achieve 4 times the throughput of a superscalar, and double that of fine-grain multithreading. We evaluate several cache configurations made possible by this type of organization and evaluate tradeoffs between them. We also show that simultaneous multithreading is an attractive alternative to single-chip multiprocessors; simultaneous multithreaded processors with a variety of organizations outperform corresponding conventional multiprocessors with similar execution resources.While simultaneous multithreading has excellent potential to increase processor utilization, it can add substantial complexity to the design. We examine many of these complexities and evaluate alternative organizations in the design space.

1,713 citations

Proceedings ArticleDOI
01 Jun 1990
TL;DR: The Tera architecture was designed with several goals in mind; it needed to be suitable for very high speed implementations, i.
Abstract: The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This goal will be achieved; a maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O cache units, 256 I/O processors, and 4096 interconnection network nodes and a clock period less than 3 nanoseconds. The abstract architecture is scalable essentially without limit (although a particular implementation is not, of course). The only requirement is that the number of instruction streams increase more rapidly than the number of physical processors. Although this means that speedup is sublinear in the number of instruction streams, it can still increase linearly with the number of physical pro cessors. The price/performance ratio of the system is unmatched, and puts Tera’s high performance within economic reach. Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectoriae well, perhaps because of a preponderance of scalar operations or too-frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism available in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multiuser timeand space-sharing. The architecture

797 citations

Proceedings ArticleDOI
01 May 1995
TL;DR: Analysis of the MIT Alewife machine shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives.
Abstract: Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processors to cooperate on small problem sizes; and latency tolerance mechanisms --- including block multithreading and prefetching --- mask unavoidable delays due to communication.Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help to analyze the behavior of the system. Analysis shows that integrating message passing with shared memory enables a cost-efficient solution to the cache coherence problem and provides a rich set of programming primitives. Block multithreading and prefetching improve performance by up to 25% individually, and 35% together. Finally, language constructs that allow programmers to express fine-grain synchronization can improve performance by over a factor of two.

416 citations


"A multithreaded processor designed ..." refers methods in this paper

  • ...the block multithreaded Processors SParCle of the MIT Alewife machine [ 4 ], MSparc [51 and MTA [61, the multithreaded superscalar processors developed at the Media Research Laboratory of Matsushita Electric Industrial Co. [7], at the University of California, Irvine [8], at the University of Karlsruhe [9], and the...

    [...]

Proceedings Article
01 Jan 1990
TL;DR: The Tera architecture was designed with several goals in mind; it needed to be suitable for very high speed implementations, i.
Abstract: The Tera architecture was designed with several ma jor goals in mind. First, it needed to be suitable for very high speed implementations, i. e., admit a short clock period and be scalable to many processors. This goal will be achieved; a maximum configuration of the first implementation of the architecture will have 256 processors, 512 memory units, 256 I/O cache units, 256 I/O processors, and 4096 interconnection network nodes and a clock period less than 3 nanoseconds. The abstract architecture is scalable essentially without limit (although a particular implementation is not, of course). The only requirement is that the number of instruction streams increase more rapidly than the number of physical processors. Although this means that speedup is sublinear in the number of instruction streams, it can still increase linearly with the number of physical pro cessors. The price/performance ratio of the system is unmatched, and puts Tera’s high performance within economic reach. Second, it was important that the architecture be applicable to a wide spectrum of problems. Programs that do not vectoriae well, perhaps because of a preponderance of scalar operations or too-frequent conditional branches, will execute efficiently as long as there is sufficient parallelism to keep the processors busy. Virtually any parallelism available in the total computational workload can be turned into speed, from operation level parallelism within program basic blocks to multiuser timeand space-sharing. The architecture

317 citations

Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "A multithreaded processor designed for distributed shared memory systems" ?

The authors present two techniques to reduce the context switch cost to at most one processor cycle: The authors evaluate four implementation alternatives of the load/store unit to increase processor performance.