scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 2001"


Patent
12 Apr 2001
TL;DR: In this article, the authors propose a sub-frame structure for variable high data rates with the flexibility to efficiently carry lower data rate, lower latency frames using sub-framing, where each voice customer is allotted one or more frames or portions of frames within the superframe, called sub-frames, as is needed to deliver the lower latency voice communication.
Abstract: A frame structure that is ordinarily optimized for providing variable high data rates also includes the flexibility to efficiently carry lower data rate, lower latency frames using sub-framing. Superframes, each comprised of a predetermined number of frames, carry voice and data communications at one or more variable data rates. The size of a superframe is limited, such as by the delay tolerance for voice transmission, typically 20 ms. Each voice customer is allotted one or more frames or portions of frames within the superframe, called sub-frames, as is needed to deliver the lower data rate, low latency voice communication. The allocation for the voice customers is not fixed, but varies as the data rate varies over time. Any bits in a frame that are not needed to carry voice communication are assigned to carry data having compatible data rate requirements. Additionally, the sub-framing concept may be extended to include ATM cells.

106 citations


Patent
28 Nov 2001
TL;DR: In this paper, a method for multiplexing compressed video data streams where the time for sending portions of a video frame are adjusted to reduce latency is described, where a compressed frame is broken into parts and a part is sent in an earlier frame time.
Abstract: A method for multiplexing compressed video data streams where the time for sending portions of a video frame are adjusted to reduce latency. If a compressed frame cannot be delivered in the appropriate frame time, due to bandwidth limitations, the frame is broken into parts and a part is sent in an earlier frame time. This method allows complete frames to be available at a receiver at the correct time. Accurate methods of deriving clock signals from the data stream are also described.

45 citations


Proceedings ArticleDOI
11 Jun 2001
TL;DR: An IEEE floating-point adder (FP-adder) design that accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sum/difference in the format required by the IEEE standard is presented.
Abstract: We present an IEEE floating-point adder (FP-adder) design. The adder accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sum/difference in the format required by the IEEE standard. The latency of the design for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. Moreover, the design can be easily partitioned into 2 stages consisting of 12 logic levels each, and hence, can be used with clock periods that allow for 12 logic levels between latches. The FP-adder design achieves low latency by combining various optimization techniques such as: a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on complement subtraction, compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation. A comparison of our design with other implementations suggests a reduction in the latency by at least two logic levels as well as simplified rounding implementation. A reduced precision version of our algorithm has been verified by exhaustive testing.

40 citations


01 Jan 2001
TL;DR: The IEEE floating-point adder (FP-adder) as mentioned in this paper achieves a low latency by combining various optimization techniques such as a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one's complement subtraction and compound adders.
Abstract: We present an IEEE floating-point adder (FP-adder) design. The adder accepts normalized numbers, supports all four IEEE rounding modes, and outputs the correctly normalized rounded sumidifference in the format required by the IEEE Standard. The latency of the design for double precision is roughly 24 logic levels, not including delays of latches between pipeline stages. Moreover; the design can be easilypariitioned into 2 stages consisting of 12 logic levels each, and hence, can be used with clock periods that allow for 12 logic levels between latches. The FP-adder design achieves a low latency by combining various optimization techniques such as: a non-standard separation into two paths, a simple rounding algorithm, unifying rounding cases for addition and subtraction, sign-magnitude computation of a difference based on one’s complement subtraction, compound adders, and fast circuits for approximate counting of leading zeros from borrow-save representation. A comparison of our design with other implementations suggests a reduction in the latency by at least two logic levels as well as simplified rounding implementation. A reduced precision version of our algorithm has been verijied by exhaustive testing.

38 citations


Patent
19 Jan 2001
TL;DR: In this paper, the authors propose a sub-frame structure for variable high data rates with the flexibility to efficiently carry lower data rate, lower latency frames using sub-framing, where each voice customer is allotted one or more frames or portions of frames within the superframe, called sub-frames, as is needed to deliver the lower latency voice communication.
Abstract: A frame structure that is ordinarily optimized for providing variable high data rates also includes the flexibility to efficiently carry lower data rate, lower latency frames using sub-framing. Superframes, each comprised of a predetermined number of frames, carry voice and data communications at one or more variable data rates. The size of a superframe is limited, such as by the delay tolerance for voice transmission, typically 20 ms. Each voice customer is allotted one or more frames or portions of frames within the superframe, called sub-frames, as is needed to deliver the lower data rate, low latency voice communication. The allocation for the voice customers is not fixed, but varies as the data rate varies over time. Any bits in a frame that are not needed to carry voice communication are assigned to carry data having compatible data rate requirements. Additionally, the subframing concept may be extended to include ATM cells.

36 citations


Journal ArticleDOI
01 Jul 2001
TL;DR: Although it has a larger number of channels compared to the crossbar and the mesh, the SOME-Bus is much simpler and inexpensive because it is free of complex routing, congestion and blocking.
Abstract: The performance of a multi-computer system based on the simultaneous optical multi-processor exchange bus (SOME-Bus) interconnection network is examined using queuing network models under the message-passing and distributed-shared-memory (DSM) paradigms. The SOME-Bus is a low latency, high bandwidth, fiber-optic interconnection network which directly links arbitrary pairs of processor nodes without contention. It contains a dedicated channel for the data output of each node, eliminating the need for global arbitration and providing bandwidth that scales directly with the number of nodes in the system. Each of N nodes has an array of receivers, with one receiver dedicated to each node output channel. No node is ever blocked from transmitting by another transmitter or due to contention for shared switching logic. The entire N -receiver array can be integrated on a single chip at a comparatively minor cost resulting in O( N ) complexity. By supporting multiple simultaneous broadcasts of messages, the SOME-Bus has much more functionality than a crossbar, allowing synchronization phases and cache consistency protocols to complete much faster. Simulation results are presented which validate the theoretical results and compare processor utilization in the SOME-Bus, the crossbar and the torus, with and without synchronization. Compared to these two networks, the SOME-Bus performance is least affected by large message communication times. Even in the presence of frequent synchronization, processor utilization remains practically unaffected while it drops in the other architectures. Although it has a larger number of channels compared to the crossbar and the mesh, the SOME-Bus is much simpler and inexpensive because it is free of complex routing, congestion and blocking.

34 citations


Book
02 May 2001
TL;DR: This tutorial presents a systemic approach to high-speed networks, where the goal is to provide high bandwidth and low latency to distribute applications, and to deal with the high bandwidth-x-delay product that results from high- speed networking over long distances.
Abstract: Summary form only given. The tutorial presents a comprehensive introduction to all aspects of high-speed networking, based on the book by J.P.G. Sterbenz and J.D. Touch ("High-Speed Networking: A Systematic Approach to High-Bandwidth Low-Latency Communication", John Wiley, 2001). The target audience includes computer scientists and engineers who may have expertise in a narrow aspect of high-speed networking, but want to gain a broader understanding of all aspects of high-speed networking and the impact that their designs have on overall network performance. The tutorial is not about any particular protocols and standards, but is rather a systemic and systematic approach to the principles that guide the research and design of high-speed networks, protocols, and applications. The tutorial presents a set of fundamental axioms, some major topics and a set of design principles that are defined and applied to each of the topics. A set of design techniques are introduced and applied as appropriate.

29 citations


Patent
02 Jan 2001
TL;DR: In this article, a method and apparatus for the analysis and dissemination of data in an on-line trading environment is described, and a trading environment implemented on a server such that users have easy access to an electronic trading exchange via the Internet.
Abstract: A method and apparatus for the analysis and dissemination of data in an on-line trading environment is disclosed. The trading environment is implemented on a server such that users have easy access to an electronic trading exchange via the Internet. A user may set account preferences such that a table is generated and displayed for easy recognition of preferred, acceptable and unacceptable trading partners. In addition, the identity of all sellers may remain unknown to protect the integrity of the trading environment.

22 citations


Proceedings ArticleDOI
05 Feb 2001
TL;DR: A 20-data-channel transceiver with a control channel allows uncoded data transfer with 13 ns latency and a digital DLL with a ring-interpolator tracks phase with 20 ps resolution enables full-duplex bandwidth reaches 10 GB/s.
Abstract: A 20-data-channel transceiver with a control channel allows uncoded data transfer with 13 ns latency. A digital DLL with a ring-interpolator tracks phase with 20 ps resolution. A pre-emphasis driver enables 2 Gb/s transmission per channel over a 7 m cable at 1.5 V. The effective full-duplex bandwidth reaches 10 GB/s.

17 citations


Patent
26 Apr 2001
TL;DR: In this article, a commit message is returned to a source processor that requests a memory access operation so as to indicate the apparent completion of the operation, and a multiple-level switch unit linking nodes that contain the processors.
Abstract: A multiple-processor system in which a commit message is returned to a source processor that requests a memory access operation so as to indicate the apparent completion of the operation includes a multiple-level switch unit linking nodes that contain the processors. The switch unit includes multiple input switches each of which receives messages from multiple nodes, and a set of output switches whose inputs are the outputs of the input switches and whose outputs are the inputs of the nodes. Each switch processes messages in the order in which they are received by the switch and each output switch follows the same rule as the other output switches.

16 citations


Patent
Wen-Hsiao Peng1, Yen-Kuang Chen1
26 Sep 2001
TL;DR: In this article, fractional parts of quantized video coefficients are used as enhancement layers when encoding a video steam, which allows the reuse of decoding components and allows the use of the fractional coefficients for decoding.
Abstract: Fractional parts of quantized video coefficients are used as enhancement layers when encoding a video steam. This use of the fractional parts allows the reuse of decoding components.

Proceedings ArticleDOI
06 May 2001
TL;DR: A new scheduling algorithm is introduced, differential effective service, which outperforms existing schedulers with respect to effective packet delay and is introduced to provide simulation results.
Abstract: Non-real-time (NRT) services will constitute a majority of the services provided by third generation systems. The primary QoS parameter for NRT traffic flows is traffic handling priority (THP) which determines the ratios of specified class performance measures. Good system performance is generally defined by low latency and low likelihood of packet loss. The most common measure for latency is average packet delay; however, we show that this measure does not truly reflect the users perceived packet delay. Instead, we introduce effective packet delay which measures delay from the users perspective. Furthermore, we investigate the ability of existing priority and differential service scheduling algorithms to achieve the prescribed THPs for the above performance measures as well as the effect of varying channel bit rate on scheduler performance. Finally, we introduce a new scheduling algorithm, differential effective service, which outperforms existing schedulers with respect to effective packet delay and provide simulation results.

28 May 2001
TL;DR: The issue of VLSI design of low latency/low power finite field multipliers is addressed and methods from logic structure, circuit design and physical mapping aspects are presented and an irregular balanced-tree parallel multiplier is proposed.
Abstract: The issue of VLSI design of low latency/low power finite field multipliers is addressed and methods from logic structure, circuit design and physical mapping aspects are presented. With proposed architecture and physical mapping, an irregular balanced-tree parallel multiplier con be implemented as easy as a regular multiplier. The custom VLSI implementations of these multipliers over GF(2/sup m/) show that the irregular multiplier has 53% smaller delay and 58% less power consumption than a regular multiplier.

Proceedings ArticleDOI
01 Jan 2001
TL;DR: Some recent work into the potential of utilizing structures within such memory macros as cache substitutes, and under what conditions power savings may result are summarized.
Abstract: The new technology of Processing-In-Memory now allows relatively large DRAM memory macros to be positioned on the same die with processing logic. Despite the high bandwidth and low latency possible with such macros, more of both is always better. Classical techniques such as caching are typically used for such performance gains, but at the cost of high power. The paper summarizes some recent work into the potential of utilizing structures within such memory macros as cache substitutes, and under what conditions power savings may result.

Proceedings ArticleDOI
11 Jun 2001
TL;DR: Numerical results demonstrate that the proposed FTS-ACM scheme achieves a comparable bandwidth efficiency gain to adaptive BICM, which is particularly attractive for low latency transmission applications.
Abstract: In wireless systems supporting slowly moving users, adaptive trellis-coded modulation (TCM) schemes have demonstrated large bandwidth efficiency gains over their nonadaptive counterparts. In systems with highly mobile users, the adaptive bit-interleaved coded modulation (BICM) achieves a moderate bandwidth efficiency gain over previously proposed adaptive schemes and nonadaptive schemes with similar complexity. However, adaptive BICM requires a bit interleaver, which results in long latency. In this paper, adaptive coded modulation (ACM) schemes which do not employ interleaving and do not use uncoded bits are considered for time-varying channels. Two such ACM schemes are proposed. One of the ACM schemes uses a forward trellis search algorithm (FTS) to adapt to the current channel fading. Numerical results demonstrate that the proposed FTS-ACM scheme achieves a comparable bandwidth efficiency gain to adaptive BICM. FTS-ACM is particularly attractive for low latency transmission applications.

Patent
Thomas Fuehrer1, Bernd Müller1
28 Dec 2001
TL;DR: In this paper, a method and a communication system for exchanging data between at least two users interconnected over a bus system are described, where the data is contained in messages which are transmitted by the users over the bus system.
Abstract: A method and a communication system for exchanging data between at least two users interconnected over a bus system are described. The data is contained in messages which are transmitted by the users over the bus system. To improve data exchange among users so that in the normal case, there is a high probability that it will be possible to transmit messages with a low latency, while on the other hand, in the worst case, a finite maximum latency can be guaranteed, the data be transmitted in an event-oriented method over the bus system as long as a preselectable latency period elapsing between a transmission request by a user and the actual transmission operation of the user can be guaranteed for each message to be transmitted as a function of the utilization of capacity of the bus system, and otherwise the data is transmitted over the bus system by a deterministic method.

Proceedings ArticleDOI
06 May 2001
TL;DR: An improved parallel lattice structure to implement IFFT module in the transmitter of Discrete Multitone (DMT) system is proposed and a new pre-processing scheme is added to previous work, exploiting the symmetric/anti-symmetric properties of the input symbols.
Abstract: In this paper, we propose an improved parallel lattice structure to implement IFFT module in the transmitter of Discrete Multitone (DMT) system. By exploiting the symmetric/anti-symmetric properties of the input symbols, we add a new pre-processing scheme to previous work. By doing so, the iteration number can be halved from 2N to N (N is 256 in DMT). As a result, the clock rate of the IFFT lattice module can be lowered under the same input symbol rate. In addition, the internal registers are also reduced from 4N-4 to 2N-2, and multiplexers in post-processing circuits are all eliminated compared with the previous design. Hence, the proposed method can save more power consumption and further reduce the hardware complexity of the IFFT module. The proposed architecture is regular, modular, and free of global routing. Thus, it is very suitable for VLSI implementation.

Patent
12 Mar 2001
TL;DR: In this paper, an enhanced timing loop is presented for a partial-response maximum-likelihood (PRML) data channel in a direct access storage device (DASD).
Abstract: Methods and apparatus for enhanced timing loop are provided for a partial-response maximum-likelihood (PRML) data channel in a direct access storage device (DASD). An acquisition timing circuit for generating an acquisition timing signal includes a plurality of compare functions for receiving and comparing consecutive input signal samples on an interleave with a threshold value. The acquisition timing circuit includes a majority rule voting function coupled to the plurality of compare functions for selecting a timing interleave. Tracking timing circuitry for generating a timing error signal during a read operation includes a channel data detector. The channel data detector receives disk signal input samples and includes a multiple-state path memory. The tracking timing circuit includes a low latency detector receiving disk signal input samples. A selector function is coupled to an output of the low latency detector and is coupled to the multiple-state path memory for selecting a state. The selector function utilizes the low latency detector output and selects the state of the path memory. The selector function provides a low latency output corresponding to the selected state. The low latency output is used for generating the timing error signal during a read operation.

Proceedings ArticleDOI
01 Feb 2001
TL;DR: This work devised an implementation scheme for on-line algorithms that dynamically relocate a file in the centre of gravity of the set of users that are more frequently accessing it, and measured the algorithms performance in real settings.
Abstract: The allocation of shared resources in a distributed system is a key aspect to achieve both low latency in accessing the resource and low bandwidth consumption. When the set of users accessing a resource dynamically changes, the allocation policy should adapt the resource placement over time. In the literature, on-line algorithms have been proposed that dynamically relocate a file in the centre of gravity of the set of users that are more frequently accessing it. However solar those algorithms have not been implemented. They must be adapted to work in real systems, and their interactions must be investigated with the existing network protocols and applications. In this work we study the behaviours of some of those algorithms in real environments. To this purpose, we devised an implementation scheme, and we measured the algorithms performance in real settings.

Proceedings ArticleDOI
08 Oct 2001
TL;DR: This paper proposes two deadlock-free schemes that allow traffic through the network while the reconfiguration is being performed and analyzes the impact of network size and load on their behavior.
Abstract: Switched point-to-point interconnection networks provide the high bandwidth and low latency required by current distributed applications. When the topology changes, a reconfiguration of the routing tables is performed to maintain network connectivity. In order to prevent deadlock, traditional reconfiguration schemes discard application traffic during the reconfiguration process. The consequence is that the network cannot provide the bandwidth demanded by user applications. In order to solve this problem, we proposed two deadlock-free schemes that allow traffic through the network while the reconfiguration is being performed By using these schemes, the network is able to fulfill the applications requirements. In this paper, we evaluate these traditional and novel reconfiguration schemes. In particular, we analyze the impact of network size and load on their behavior. Application traffic has been modeled by means of a self-similar pattern. Simulation results clearly show the large performance degradation associated with the traditional approach and the significant benefits that can be obtained by using dynamic reconfiguration techniques.


08 Dec 2001
TL;DR: Stream-based and time-advance systems are compared in terms of the programming model, flow control, buffering, support for interaction, synchronization, modularity issues, and real-time requirements.
Abstract: Keywords: Multimedia, streams, time advance, synchronization, audio, video, real time, operating systems 20 pages [FTP: CMU-CS-94-124.ps] A common model for multimedia systems is the stream, an abstraction representing the flow of continuous time-dependent data such as audio samples and video frames. The primary feature of streams is the ability to compose processes by making stream connections between them. An alternative time-advance model is related to discrete-event simulations. Data is computed in presentation order, but in advance of the actual presentation time. Timestamped, buffered data is subsequently output with low latency. The primary feature of time-advance systems is accurate output timing. Stream-based and time-advance systems are compared in terms of the programming model, flow control, buffering, support for interaction, synchronization, modularity issues, and real-time requirements.


Proceedings ArticleDOI
31 Jan 2001
TL;DR: The possibility of using AF service-capable networks to support the expedited forwarding (EF) service for the support of latency and jitter sensitive traffic is examined and the implication of a positive answer is that it will be possible to provide low delay and low delay jitter guarantees to video/audio-like traffic.
Abstract: Assured forwarding (AF) service allows the Internet service provider (ISP) to offer different levels of forwarding assurances to IP packets received from a customer. However, in a basic AF service-capable network, it is not possible to guarantee low latency and low jitter to IP packets. To support the transport of video/audio traffic with acceptable delay and jitter, the AF service is not sufficient. To rectify this, the Internet Engineering Task Force (IETF) has proposed the expedited forwarding (EF) service for the support of latency and jitter sensitive traffic. We examine the possibility of using AF service-capable networks to support the EF service. The implication of a positive answer to this question is that it will be possible to use AF service-capable networks to provide low delay and low delay jitter guarantees to video/audio-like traffic. We also analyze the impact of the EF service traffic on AF and best-effort traffic.

01 Jan 2001
TL;DR: This paper presents the design and analysis of a lightweight service for message-passing communication and parallel process coordination, based on the MPI specification, for unicast and collective communications.
Abstract: Rapid increases in the complexity of algorithms for real-time signal processing applications have led to performance requirements exceeding the capabilities of conventional digital signal processor (DSP) architectures. Many applications, such as autonomous sonar arrays, are distributed in nature and amenable to parallel computing on embedded systems constructed from multiple DSPs networked together. However, to realize the full potential of such applications, a lightweight service for message-passing communication and parallel process coordination is needed that is able to provide high throughput and low latency while minimizing processor and memory utilization. This paper presents the design and analysis of such a service, based on the MPI specification, for unicast and collective communications.

Book ChapterDOI
23 Sep 2001
TL;DR: The design and implementation of an interface to allow a simple integration of different (high speed) network interconnects for a message passing environment and the given results show PVM's capability of achieving low latency, high bandwidth using appropriate devices.
Abstract: This paper describes the design and implementation of an interface to allow a simple integration of different (high speed) network interconnects for a message passing environment. In particular, a common SAN layer has been developed and tested with an improved PVM version. The latter has been extended to provide a pluggable interface for low level SAN layer. In this context GM and PM for Myrinet and SISCI for Scalable Coherent Interface (SCI) have been implemented. With this pluggable interface an approach such as the channel device for MPICH has been made to easily integrate existing and new networking devices. Multi Protocols which handle several interconnects are supported as well. This allows for harnessing heterogeneous clusters with different high speed interconnects using the fastest available communication device when possible. The given results show PVM's capability of achieving low latency, high bandwidth using appropriate devices.

Patent
25 Sep 2001
TL;DR: In this article, a novel FIFO data structure in the form of a multi-dimensional FIFOs is presented, where data items are received at an input of an N-row-by-M-column FIFOS array of cells and transferred to an output, via a predetermined protocol of cell transfers, in the same order as received.
Abstract: A novel FIFO data structure in the form of a multi-dimensional FIFO. For a rectangular multi-dimensional FIFO, data items are received at an input of an N-row-by-M-column FIFO array of cells and transferred to an output, via a predetermined protocol of cell transfers, in the same order as received. Transfer rules or protocol are controlled by a control circuit implemented using asynchronous pipeline modules or a control circuit relying upon transition signaling.

Proceedings ArticleDOI
31 Jan 2001
TL;DR: This paper reports on the new implementation of a DSM cluster computing system using the low-level application programming interface (LAPI) provided as part of IBM SP2 software and discusses and evaluates the new and previous DSE-LAPI implementations using application benchmarks with SPLASH-2 programs.
Abstract: Parallel applications on network-based computing are most sensitive to communication overhead. The performance of a cluster computing system largely depends on the bandwidth, latency, and communication software processing overhead of the communication subsystem. The currently available fiber optics and opto-electronic device technologies offers solution to the bandwidth problem. However, achieving low latency remains a challenge and is considered by the research community as one critical issue to attain high-performance cluster computing. Different approaches to solve the problem have been proposed which deals with latency reduction and latency hiding. In this paper, we report on the new implementation of a DSM cluster computing system using the low-level application programming interface (LAPI) provided as part of IBM SP2 software. We discuss and evaluate the new (DSE-LAPI) and previous (DSE-TCP) implementations using application benchmarks with SPLASH-2 programs, i.e., FFT, radix, and LU. Likewise, we evaluate the scalability performance of the new implementation. Experimental results show promising performance of the new implementation and it further demonstrate the relative merit of adapting LAPI on a DSM cluster.

Proceedings ArticleDOI
08 Oct 2001
TL;DR: A software VIA system is implemented on an 8-node SCI(Scalable Coherent Interface) network based PC cluster that provides both message-passing and shared-memory programming environments and shows a maximum bandwidth of 84MB/s and a minimum latency of 8§A on application level.
Abstract: : The performance of a PC cluster system is limited by the use of traditional communication protocols, such as TCP/IP, because these protocols are accompanied with significant software overheads. To overcome the problem, systems based on user-level message-passing interface without intervention of the kernel have been developed. VIA(Virtual Interface Architecture) is a widely adopted user-level message-passing interface which provides low latency and high bandwidth. In this paper, a software VIA system is implemented on an 8-node SCI(Scalable Coherent Interface) network based PC cluster. The system provides both message-passing and shared-memory programming environments and shows a maximum bandwidth of 84MB/s and a minimum latency of 8§A on application level. An average speed-up of 6.1 was obtained in executing the NAS parallel benchmark programs on the 8-node SCI-based PC cluster. The system also shows better performance in comparison with other comparable cluster systems in carrying out the parallel benchmark programs.

01 Jan 2001
TL;DR: The authors summarizes some recent work into the potential of utilizing structures within such memory macros as cache substitutes, and under what conditions power savings may result under different conditions power saving may be achieved.
Abstract: The new technology of Processing-Zn-Memory now allows relatively large DRAM memory mcms to be positioned on the same die with processing logic Despite the high bandwidth and low latency possible with such macms, more of both is always bettel: Classical techniques such as caching are typically used for such performance gains, but at the cost of high powel: This paper summarizes some recent work into the potential of utilizing structures within such memory macros as cache substitutes, and under what conditions power savings may result