scispace - formally typeset
Search or ask a question

Showing papers on "Latency (engineering) published in 1994"


Book
01 Jun 1994
TL;DR: A simple closed-form expression for contention in buffered, direct networks is derived and found to agree closely with simulations, and it is shown that a much larger fraction of the resulting performance improvement arises from the reduction in bandwidth requirements than from the decrease in latency.
Abstract: The latency of direct networks is modeled, taking into account both switch and wire delays. A simple closed-form expression for contention in buffered, direct networks is derived and found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints and under different workload parameters reveals that performance is highly sensitive to these constraints and workloads. A two-dimensional network is shown to have the lowest latency only when switch delays and network contention are ignored; three- or four-dimensional networks are favored otherwise. If communication locality exists, two-dimensional networks regain their advantage. Communication locality decreases both the base network latency and the network bandwidth requirements of applications. It is shown that a much larger fraction of the resulting performance improvement arises from the reduction in bandwidth requirements than from the decrease in latency. >

494 citations


Proceedings Article
06 Jun 1994
TL;DR: Current results, obtained from a trace-driven simulation, show that prefetching results in as much as a 280% improvement over LRU especially for smaller caches, and can reduce cache size by up to 50%.
Abstract: Despite impressive advances in file system through put resulting from technologies such as high-bandwidth networks and disk arrays, file system latency has not improved and in many cases has become worse. Consequently, file system I/O remains one of the major bottlenecks to operating system performance [10]. This paper investigates an automated predictive approach towards reducing file latency. Automatic Prefetching uses past file accesses to predict future file systemrequests. The objective is to provide data in advance of the request for the data, effectively masking access latencies. We have designed and implement a system to measure the performance benefits of automatic prefetching. Our current results, obtained from a trace-driven simulation, show that prefetching results in as much as a 280% improvement over LRU especially for smaller caches. Alternatively, prefetching can reduce cache size by up to 50%.

402 citations


01 Jun 1994
TL;DR: This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code that attempts to minimize overheads by only issuing prefetched for references that are predicted to suffer cache misses, and investigates the architectural support necessary to make prefetching effective.
Abstract: The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea. Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits. This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.

262 citations



Journal ArticleDOI
TL;DR: In four experiments, subjects freely recalled previously studied items while a voice key and computer recorded each item’s recall latency relative to the onset of the recall period, suggesting that retrieval includes a brief normally distributed initiation stage followed by a longer exponentially distributed search stage.
Abstract: In four experiments, subjects freely recalled previously studied items while a voice key and computer recorded each item’s recall latency relative to the onset of the recall period. The measures of recall probability and mean recall latency were shown to be empirically independent, demonstrating that there exists no a priori relationship between the two. In all four experiments, latency distributions were fit well by the ex-Gaussian, suggesting that retrieval includes a brief normally distributed initiation stage followed by a longer exponentially distributed search stage. Further, the variation in mean latency stemmed from the variation in the duration of the search stage, not the initiation stage. Interresponse times (IRTs), the time elapsed between two successive item recalls, were analyzed as well. The growth of mean IRTs, plotted as a function of output position, was shown to be a simple function of the number of items not yet recalled. Finally, the mathematical nature of both free recall latency and IRT growth are shown to be consistent with a simple theoretical account of retrieval that depicts mean recall latency as a measure of the breadth of search.

161 citations


Proceedings ArticleDOI
01 Jan 1994
TL;DR: It is found that a FLASHCACHE can reduce the power consumption of the storage subsystem by 20-40% and can improve overall response time by 30-70% when combined with an aggressive disk management policy.
Abstract: We examine the impact of using flash memory as a second-level file system buffer cache to reduce power consumption and file access latency on a mobile computer. We use trace-driven simulation to evaluate the impact of what we call a FLASHCACHE. We relate the power consumption and access latency of the storage sub-system to the characteristics of the FLASHCACHE: its size, the unit of erasure, and access costs. We find that a FLASHCACHE can reduce the power consumption of the storage subsystem by 20-40% and can improve overall response time by 30-70% when combined with an aggressive disk management policy. When combined with a more conservative policy, power is reduced from 40-70% while overall response time is improved 20-60%. We also find that durability is not a problem; a 4 MB FLASHCACHE will last 33 years. >

102 citations


Patent
30 Jun 1994
TL;DR: In this article, a system for controlling the transmission of cells from a network node over multiple Virtual Circuits (VCs) is disclosed, which performs traffic shaping, as required by connection-based systems such as Asynchronous Transfer Mode (ATM), so that the Quality of Service (Qos) parameters established when the connection was established are not exceeded.
Abstract: A system for controlling the transmission of cells from a network node over multiple Virtual Circuits (VCs) is disclosed. The system performs traffic shaping, as required by connection based systems such as Asynchronous Transfer Mode (ATM), for each VC connected with a network node, so that the Quality of Service (Qos) parameters established when the connection was established are not exceeded. The system includes a process for scheduling the transmission of cells from the network node. The scheduling process periodically scans a table having entries corresponding to virtual circuits connected with the network node. During each scan of the table, the scheduler increments a sustainable rate accumulator field, a peak rate accumulator field, and a latency accumulator field of each table entry that corresponds with a virtual circuit that is open, and for which there is a cell ready to be transmitted. The scheduler further determines if the sustainable rate accumulator value is greater than or equal to a predetermined value and whether the peak rate accumulator value is greater than or equal to a predetermined value. If both conditions are true, then a cell may be transmitted on the virtual circuit corresponding with that table entry. The system further provides that transmissions are scheduled on virtual circuits having the greatest latency since previous transmissions.

99 citations


Journal ArticleDOI
TL;DR: The results support the hypotheses that the pattern of latency changes during activity are signatures for the modality in a given fiber; and that endogenous, activity-dependent processes of the axon contribute to adaptation and encoding in cutaneous sensory afferents.
Abstract: Cutaneous afferents exhibit changes in excitability after impulse activity that are correlated with functional modality but are independent of axonal diameter, as studied in 39 cold fibers and 51 nociceptors of the rat. Latency of conducted impulses was used to indicate changes in axonal excitability caused by electrical stimulation. Stimuli were applied both at fixed frequencies and at the time intervals of impulses previously recorded during response to natural stimulation. Latency increased following both these forms of electrical stimulation, as well as after natural stimulation of the receptive fields. The latency increase was correlated with the number of impulses and the frequency of the preceding discharge in all of 4 nociceptors and 13 cold fibers studied for this feature. Increase of latency by electrical or natural stimulation led to reduced responsiveness to natural stimulation. The magnitude and time course of latency changes were correlated with fiber modality. In 32 nociceptors the latency ...

70 citations


Journal ArticleDOI
TL;DR: It is shown that network latency forms a major obstacle to improving parallel computing performance and scalability, and an experimental metric, using network latency to measure and evaluate the scalability of parallel programs and architectures is presented.

63 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: It is shown how an optimized parallel operating system can be constructed such that the application processor's involvement in communication is kept to a minimum while the utilization of both processors is maximized.
Abstract: The paper demonstrates the advantages of having two processors in the node of a distributed memory architecture, one for computation and one for communication. The architecture of such a dual-processor node is discussed. To exploit fully the potential for parallel execution of computation threads and communication threads, a novel, compiler-optimized IPC mechanism allows for an unbuffered no-wait send and a prefetched receive without the danger of semantics violation. It is shown how an optimized parallel operating system can be constructed such that the application processor's involvement in communication is kept to a minimum while the utilization of both processors is maximized. The MANNA implementation results in an effective message start-up latency of only 1...4 microseconds. It is also shown how the dual-processor node is utilized to efficiently realize virtual shared memory. >

50 citations


Journal ArticleDOI
TL;DR: A detailed simulation study of the latency effects in decoupled computers is undertaken and it is concluded that despite their capability to partially mask the effects of memory latency, decoupling architectures still need a data cache.
Abstract: Decoupled computer architectures partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. These architectures make use of an access processor to perform the data fetch ahead of demand by the execute process and hence are often less sensitive to memory access delays than conventional architectures. Past performance studies of decoupled computers used memory systems that are interleaved or pipelined, and in those studies, latency effects were partially hidden due to interleaving. A detailed simulation study of the latency effects in decoupled computers is undertaken in this paper. Decoupled architecture performance is compared to single processors with caches. The memory latency sensitivity of cache based uniprocessors and decoupled systems is studied. Simulations are performed to determine the significance of data caches in a decoupled architecture. It is observed that decoupled architectures can reduce the peak memory bandwidth requirement, but not the total bandwidth, whereas data caches can reduce the total bandwidth by capturing locality. It may be concluded that despite their capability to partially mask the effects of memory latency, decoupled architectures still need a data cache. >

Proceedings Article
17 Jan 1994
TL;DR: It is found that a low latency network controller has a significant impact on the overall latency of TCP, and some widely discussed improvements to TCP, such as header prediction and the combination of the checksum calculation with data copying are found.
Abstract: In this paper we characterize the latency of the BSD 4.4 alpha implementation of TCP on an ATM network. Latency reduction is a difficult task, and careful analysis is the first step towards reduction. We investigate the impact of both the network controller and the protocol implementation on latency. We find that a low latency network controller has a significant impact on the overall latency of TCP. We also characterize the impact on latency of some widely discussed improvements to TCP, such as header prediction and the combination of the checksum calculation with data copying.

Journal ArticleDOI
TL;DR: In these systems, through the application of contemporary molecular biological tools, descriptive features concerning the role of virus and cell in establishment, maintenance and reactivation of the latent state are reasonably clear but the mechanisms responsible are not so clear.

Proceedings ArticleDOI
19 Dec 1994
TL;DR: Experiments show that latency hiding techniques increase the feasibility of parallel computing in high-latency networks of workstations across the Internet as well as in multiprocessor systems.
Abstract: Very large problems with high resource requirements of both computation and communication could be tackled with large numbers of workstations. However for LAN-based networks, contention becomes a limiting factor whereas latency appears to limit communication for WAN-based networks, nominally the Internet. We describe a model to analyze the gain of communication latency hiding by overlapping computation and communication. This model illustrates the limitations and opportunities of communication latency hiding for improving speedup of parallel computations that can be structured appropriately. Experiments show that latency hiding techniques increase the feasibility of parallel computing in high-latency networks of workstations across the Internet as well as in multiprocessor systems.


01 Jan 1994
TL;DR: PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations, and can be implemented at a cost of less than %50/PC, including cables.
Abstract: There are a lot of 3861486lPentium-based personal computers (PCs) out there. They are affordable, reliable, and offer good performance. Thus, it is only natural to think of networking multiple PCs to create a high-performance parallel machine the problem is that conventional networking systems cannot provide low latency synchronization and communication. Lou. latency allows fine grain parallelism; the longer the latency, the fewer thc' pn)g;ams that can achieve good speedup through use of parallelism. Typical parallel machines constructed using PC networks (e.g., PVM software using Ethernet hardware) generally have latencies between 0.001s and 0.1s. Even the "best" commercially-available parallel computers can do no better than a latency corresponding to the time to execute hundreds to thousands of floating-point operations. In contrast, PAPERS (Purdue's Adapter for Parallel Execution and Rapid Synchronization) provides a latency corresponding to execution of just a few floating-point operations. Despite this, PAPERS can be implemented at a cost of less than %50/PC, including cables. ' This work was supported in pan by the Office of Naval Research (ONR) under grant number N00014-91-J-4013 and by the National Science Foundation (NSF) under award number 9015696-CDA.

Proceedings ArticleDOI
18 May 1994
TL;DR: An integrated inter-process communication and scheduling scheme that can be used to minimize the end-to-end latency of multi-threaded applications and is implemented within the YARTOS kernel and is presently being ported to the Real-Time Mach kernel.
Abstract: The design of general purpose operating systems impose constraints on the way one can structure real-time applications. This paper addresses the problem of minimizing the end-to-end latency of applications that are structured as a set of cooperating (real-time) tasks. When applications are structured as a set of cooperating tasks the time required for data to progress from an input task to an output task is a function of the number of the tasks that handle the data and the deadlines of individual tasks. We present an integrated inter-process communication and scheduling scheme that can be used to minimize the end-to-end latency of multi-threaded applications. Our approach is to provide the scheduler with information on the inter-process communication interconnections between tasks and to use this information to guarantee an end-to-latency to applications that is simply a function of the timing properties of the application and not its task structure. This scheme has been implemented within the YARTOS kernel and is presently being ported to the Real-Time Mach kernel. >

Journal ArticleDOI
TL;DR: A novel prediction method for the head motion using Grey System theory, where a 6D tracker is attached to an HMD on a user's head in virtual reality applications can greatly reduce the latency by at least one half and reduce image jittering.
Abstract: In this paper we propose a novel prediction method for the head motion using Grey System theory, where a 6D tracker is attached to an HMD on a user's head in virtual reality applications. Our prediction method using Grey System Model can greatly reduce the latency by at least one half and reduce image jittering. A system latency below 100 ms or even 50 ins can be achieved, even though without prediction the latency is around 200 ms. Using 6 points in prediction with Grey System Model is currently the best in tracker prediction as we tried from 2 points to 10 points. In order to measure the latency, we also propose a way to measure it in an HMD system precisely and conveniently. During the process, we have implemented four different prototypes respectively on a PC486, a SUN SparcStation10, an SGI IndigoR4000, and a high performance computer image generator. The computation complexity of our prediction method is relatively low and therefore real time requirement is easily met.

Proceedings ArticleDOI
28 Feb 1994
TL;DR: A state-space based approach is used which treats various algorithm transformations in an integrated fashion, and answers analytically whether it is possible to simultaneously meet any given combination of constraints on latency and throughput.
Abstract: We present algorithm transformations to simultaneously optimize for throughput and latency for the important case of linear time-invariant DSP systems. Although throughput alone can be arbitrarily improved using previously published techniques, none of them is effective when latency constraints are considered. We have used a state-space based approach which treats various algorithm transformations in an integrated fashion, and answers analytically whether it is possible to simultaneously meet any given combination of constraints on latency and throughput. The analytic approach is optimum and constructive in nature, and produces a complete implementation when feasibility conditions are fulfilled. We also present a sub-optimal but hardware efficient heuristic approach. On all benchmarks the new approaches show much superior results than published ones. >

Book ChapterDOI
Virginia Lo1
TL;DR: This chapter focuses on distributed shared memory systems supported primarily through software modifications to existing virtual memory management facilities, characterized by a larger unit of sharing, typically at the page level, and are designed for loosely coupled workstation networks.
Abstract: Publisher Summary This chapter focuses on distributed shared memory (DSM) systems supported primarily through software modifications to existing virtual memory management facilities, These DSM systems are characterized by a larger unit of sharing, typically at the page level, and are designed for loosely coupled workstation networks. The chapter discusses some of the important issues regarding design and implementation of DSM systems. It discusses techniques to reduce the latency attributable to memory coherence algorithms and the methods to reduce the communication overhead incurred by the underlying network software and hardware. It categorizes the sources of latency reduction as latency stemming from the choice of coherence semantics and the design of the coherence algorithms, latency attributable to communications overhead generated by network software and hardware, and latency because of unnecessary message-passing overhead because of large page sizes and false sharing. The chapter also discusses innovations in the design of DSM systems to minimize the impact of large page sizes and false sharing. It concludes with a summary of the lessons learned and achievements attained in performance during the past five years.

Journal ArticleDOI
TL;DR: Data is presented showing that the latency of the FFP can be shortened significantly if the subject is required to attend to the evoking auditory tone burst, while the amplitude ofThe FFP remains unaffected, indicating an attention-controlled influence on signal processing in the earliest parts of the auditory pathway.
Abstract: While effects of attention on late and middle latency components of the evoked potential have been demonstrated, similar effects on brain stem evoked potentials--in particular on the human frequency-following potential (FFP)--are controversial. The FFP is a response to tone bursts in the frequency range of human language (optimum approximately 350 Hz). It has a latency of approximately 6.3 ms and is probably generated at a site peripheral to the inferior colliculus. We present data showing that the latency of the FFP can be shortened significantly (45 microseconds) if the subject is required to attend to the evoking auditory tone burst, while the amplitude of the FFP remains unaffected. This indicates an attention-controlled influence on signal processing in the earliest parts of the auditory pathway.

Patent
03 Oct 1994
TL;DR: In this paper, the authors describe a method to report the response information in a flexible and high performance manner in a multiprocessing system using two response windows, one for error or flow control and error status and the other for coherency reporting.
Abstract: A multiprocessing system utilizes a bus protocol having two response windows. The first response window is at a fixed latency from the transmission of a bus request and/or address, while the second response window, utilized for coherency reporting, is placed a configurable number of clock cycles after the bus request and address to allow for longer access, or snoop, times to perform a cache directory look-up within other bus devices. The first response window reports error or flow control and error status. Furthermore, a method had been described, which implements the reporting of response information in a flexible and high performance manner.

Proceedings ArticleDOI
26 Apr 1994
TL;DR: The author identifies significant ways in which optical technology can boost network functionality and performance when key architectural and implementation design issues are considered.
Abstract: Communication complexity and latency is a critical problem in multiprocessor systems. A significant portion of communication latency is associated with the interconnect network. Optics has many advantages for achieving low latency, scalable interprocessor communication. The author identifies significant ways in which optical technology can boost network functionality and performance when key architectural and implementation design issues are considered. A high bandwidth, reconfigurable optical interconnect capable of increased network throughput and optimal processor-memory connectivity can result from this approach. >


Proceedings ArticleDOI
12 Jun 1994
TL;DR: Communication latency can be reduced by increasing bandwidth via a sender-based anticipation technique called Parallel Communication, which is expected to be especially useful in reducing latency in automated FTP access.
Abstract: Communication latency can be reduced by increasing bandwidth via a sender-based anticipation technique called Parallel Communication. The authors apply this method to anonymous FTP. Their analysis of log files indicates that latency can be reduced to 2 round-trip times, as small as 0.6 round-trip time/file, for a 7x increase in bandwidth. This technique applies to up to 95% of the FTP traffic. This method is expected to be especially useful in reducing latency in automated FTP access. Such as in the World-WideWorld-Wide Web. >

Patent
04 May 1994
TL;DR: In this article, a configurable network interface controller that provides automatic retransmission of collided Ethernet frames from a local RAM while observing two modes of operation: (1) retransmissions of as much of the frame as possible without violating latency requirements and (2) first returning to observation of the latency requirements.
Abstract: A configurable network interface controller that provides for the automatic retransmission of collided Ethernet frames from a local RAM while observing two modes of operation: (1) retransmission of as much of the frame as possible without violating latency requirements and (2) first guaranteeing the safe retransmission of the first 64 bytes and then returning to observation of the latency requirements.

Proceedings ArticleDOI
01 Apr 1994
TL;DR: The analyses and experiments show that the latency metric is an important method to effectively measure overheads inherent in the program and the architecture for parallel computing scalability evaluation.
Abstract: Network latency, the delay caused by communication between processors and memory modules over the network in a multiprocessor system, is a major source of degraded parallel computing performance. We first give an overview of an experimental metric which uses network latency to measure and evaluate the scalability of parallel programs and architectures. We put emphasis on evaluation of latency sources and their methods in program execution. We report experimental results of evaluating the scalability of several scientific computing algorithms on the KSR-1. In comparison, we also present preliminary experiments on the CM-5 multicomputer architecture. Our comparisons indicate that the CM-5 has less network latency effects and is a more scalable architecture the the KSR-1. The analyses and experiments show that the latency metric is an important method to effectively measure overheads inherent in the program and the architecture for parallel computing scalability evaluation. >

Patent
24 May 1994
TL;DR: A latency error detection circuit including two cascaded latches receiving a clock signal from a measuring system upon the occurrence of an event and correspondingly asserting a bit to a processing system, and a circuit for clearing the first latch after the processing system acknowledges detecting the bit being asserted.
Abstract: A latency error detection circuit including two cascaded latches receiving a clock signal from a measuring system upon the occurrence of an event and correspondingly asserting a bit to a processing system, and a circuit for clearing the first latch after the processing system acknowledges detecting the bit being asserted. If the second latch is clocked before the first latch is cleared, the second latch sets an error bit indicating a latency error condition. The processor system monitors the error bit to determine whether a latency error has occurred.