scispace - formally typeset
Search or ask a question

Showing papers on "Speedup published in 1999"


Proceedings ArticleDOI
01 May 1999
TL;DR: A novel reconfigurable fabric architecture, PipeRench, optimized to accelerate these types of computations, which enables fast, robust compilers, supports forward compatibility, and virtualizes configurations, thus removing the fixed size constraint present in other fabrics.
Abstract: Future computing workloads will emphasize an architecture's ability to perform relatively simple calculations on massive quantities of mixed-width data. This paper describes a novel reconfigurable fabric architecture, PipeRench, optimized to accelerate these types of computations. PipeRench enables fast, robust compilers, supports forward compatibility, and virtualizes configurations, thus removing the fixed size constraint present in other fabrics. For the first time we explore how the bit-width of processing elements affects performance and show how the PipeRench architecture has been optimized to balance the needs of the compiler against the realities of silicon. Finally, we demonstrate extreme performance speedup on certain computing kernels (up to 190x versus a modern RISC processor), and analyze how this acceleration translates to application speedup.

478 citations


Book ChapterDOI
15 Aug 1999
TL;DR: To cluster increasingly massive data sets that are common today in data and text mining, a parallel implementation of the k-means clustering algorithm based on the message passing model is proposed and analytically shows that the speedup and the scaleup of the algorithm approach the optimal as the number of data points increases.
Abstract: To cluster increasingly massive data sets that are common today in data and text mining, we propose a parallel implementation of the k-means clustering algorithm based on the message passing model. The proposed algorithm exploits the inherent data-parallelism in the kmeans algorithm. We analytically show that the speedup and the scaleup of our algorithm approach the optimal as the number of data points increases. We implemented our algorithm on an IBM POWERparallel SP2 with a maximum of 16 nodes. On typical test data sets, we observe nearly linear relative speedups, for example, 15.62 on 16 nodes, and essentially linear scaleup in the size of the data set and in the number of clusters desired. For a 2 gigabyte test data set, our implementation drives the 16 node SP2 at more than 1.8 gigaflops.

450 citations


Journal ArticleDOI
TL;DR: It is demonstrated that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities.
Abstract: The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure and a need to introduce guaranteed qualities-of-service (QoS). Each problem can be solved independently: switches and routers can be made faster by using input-queued crossbars instead of shared memory systems; QoS can be provided using weighted-fair queueing (WFQ)-based packet scheduling. Until now, however, the two solutions have been mutually exclusive-all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing or centralized shared memory. This paper demonstrates that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities. More precisely, we show that for an N/spl times/N switch, a "speedup" of 2-1/N is necessary, and a speedup of two is sufficient for this exact emulation. Perhaps most interestingly, this result holds for all traffic arrival patterns. On its own, the result is primarily a theoretical observation; it shows that it is possible to emulate purely OQ switches with CIOQ switches running at approximately twice the line rate. To make the result more practical, we introduce several scheduling algorithms that with a speedup of two can emulate an OQ switch. We focus our attention on the simplest of these algorithms, critical cells first (CCF), and consider its running time and implementation complexity. We conclude that additional techniques are required to make the scheduling algorithms implementable at a high speed and propose two specific strategies.

429 citations


Journal ArticleDOI
TL;DR: The dR*-tree is introduced, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer in the ‘shared-nothing’ architecture with multiple computers interconnected through a network.
Abstract: The clustering algorithm DBSCAN relies on a density-based notion of clusters and is designed to discover clusters of arbitrary shape as well as to distinguish noise. In this paper, we present PDBSCAN, a parallel version of this algorithm. We use the ‘shared-nothing’ architecture with multiple computers interconnected through a network. A fundamental component of a shared-nothing system is its distributed data structure. We introduce the dRa-tree, a distributed spatial index structure in which the data is spread among multiple computers and the indexes of the data are replicated on every computer. We implemented our method using a number of workstations connected via Ethernet (10 Mbit). A performance evaluation shows that PDBSCAN offers nearly linear speedup and has excellent scaleup and sizeup behavior.

296 citations


Proceedings ArticleDOI
21 Mar 1999
TL;DR: In this article, a combined input output queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet scheduling algorithms, including WFQ and strict priorities.
Abstract: The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure, and a need to introduce guaranteed qualities of service (QoS). Each problem can be solved independently: switches and routers can be made faster by using input-queued crossbars, instead of shared memory systems; and QoS can be provided using WFQ-based packet scheduling. However, until now, the two solutions have been mutually exclusive-all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing, or centralized shared memory. This paper demonstrates that a combined input output queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet scheduling algorithms, including WFQ and strict priorities. More precisely, we show that a "speedup" of 2 is sufficient, and a speedup of 2-1/N is necessary, for this exact emulation. We introduce a variety of algorithms that configure the crossbar so that emulation is achieved with a speedup of two, and consider their running time and implementation complexity. An interesting feature of our work is that the exact emulation holds for all input traffic patterns. We believe that, in the future, these results will make possible the support of QoS in very high bandwidth routers.

289 citations


Journal ArticleDOI
TL;DR: In this paper, the authors demonstrate how efficient low-order dynamical models for micromechanical devices can be constructed using data from a few runs of fully meshed but slow numerical models such as those created by the finite element method (FEM).
Abstract: In this paper, we demonstrate how efficient low-order dynamical models for micromechanical devices can be constructed using data from a few runs of fully meshed but slow numerical models such as those created by the finite-element method (FEM). These reduced-order macromodels are generated by extracting global basis functions from the fully meshed model runs in order to parameterize solutions with far fewer degrees of freedom. The macromodels may be used for subsequent simulations of the time-dependent behavior of nonlinear devices in order to rapidly explore the design space of the device. As an example, the method is used to capture the behavior of a pressure sensor based on the pull-in time of an electrostatically actuated microbeam, including the effects of squeeze-film damping due to ambient air under the beam. Results show that the reduced-order model decreases simulation time by at least a factor of 37 with less than 2% error. More complicated simulation problems show significantly higher speedup factors. The simulations also show good agreement with experimental data.

287 citations


Proceedings ArticleDOI
21 Sep 1999
TL;DR: cJVM as discussed by the authors is a Java Virtual Machine (JVM) that provides a single system image of a traditional JVM while executing on a cluster, supporting any pure Java application without requiring any code modifications.
Abstract: cJVM is a Java Virtual Machine (JVM) that provides a single system image of a traditional JVM while executing on a cluster. cJVM virtualizes the cluster, supporting any pure Java application without requiring any code modifications. By distributing the application's work among the cluster's nodes, cJVM aims to obtain improved scalability for Java Server Applications. cJVM uses a novel object model which distinguishes between an application's view of an object and its implementation (e.g., different objects of the same class may have different implementations). This allows us to exploit knowledge on the usage of individual objects to improve performance. cJVM is work-in-progress. Our prototype runs on a cluster of IBM IntelliStations running Win/NT and are connected via a Myrinet switch. It provides a single system image to applications, distributing the application's threads and objects over the cluster. We have used cJVM to run without change a real Java application containing over 10Kloc and have achieved linear speedup for another application with a large number of independent threads. This paper discusses cJVM's architecture and implementation, showing how to provide a single system image of a traditional JVM on a cluster.

217 citations


Proceedings ArticleDOI
24 Oct 1999
TL;DR: A new data structure, called time-space partitioning (TSP) tree, is proposed that can effectively capture both the spatial and the temporal coherence from a time-varying field and can achieve substantial speedup while the storage space overhead for the TSP tree is kept at a minimum.
Abstract: This paper presents a fast volume rendering algorithm for time-varying fields. We propose a new data structure, called Time-Space Partitioning (TSP) tree, that can effectively capture both the spatial and the temporal coherence from a time-varying field. Using the proposed data structure, the rendering speed is substantially improved. In addition, our data structure helps to maintain the memory access locality and to provide the sparse data traversal so that our algorithm becomes suitable for large-scale out-of-core applications. Finally, our algorithm allows flexible error control for both the temporal and the spatial coherence so that a trade-off between image quality and rendering speed is possible. We demonstrate the utility and speed of our algorithm with data from several time-varying CFD simulations. Our rendering algorithm can achieve substantial speedup while the storage space overhead for the TSP tree is kept at a minimum.

197 citations


Journal ArticleDOI
TL;DR: A scalable parallel implementation of the self organizing map (SOM) suitable for data-mining applications involving clustering or segmentation against large data sets such as those encountered in the analysis of customer spending patterns is described.
Abstract: We describe a scalable parallel implementation of the self organizing map (SOM) suitable for data-mining applications involving clustering or segmentation against large data sets such as those encountered in the analysis of customer spending patterns. The parallel algorithm is based on the batch SOM formulation in which the neural weights are updated at the end of each pass over the training data. The underlying serial algorithm is enhanced to take advantage of the sparseness often encountered in these data sets. Analysis of a realistic test problem shows that the batch SOM algorithm captures key features observed using the conventional on-line algorithm, with comparable convergence rates. Performance measurements on an SP2 parallel computer are given for two retail data sets and a publicly available set of census data.These results demonstrate essentially linear speedup for the parallel batch SOM algorithm, using both a memory-contained sparse formulation as well as a separate implementation in which the mining data is accessed directly from a parallel file system. We also present visualizations of the census data to illustrate the value of the clustering information obtained via the parallel SOM method.

168 citations


01 Jan 1999
TL;DR: Three parallel algorithms that represent a spectrum of trade-o s between computation, communication, memory usage, synchronization, and the use of problem-speci c information are presented.
Abstract: We consider the problem of mining association rules on a shared-nothing multiprocessor. We present three parallel algorithms that represent a spectrum of trade-o s between computation, communication, memory usage, synchronization, and the use of problem-speci c information. We describe the implementation of these algorithms on IBM POWERparallel SP2, a shared-nothing machine. Performance measurements from this implementation show that the best algorithm, Count Distribution, scales linearly and has excellent speedup and sizeup behavior. The results from this study, besides being of interest in themselves, provide guidance for the design of parallel algorithms for other data mining tasks. Also Department of Computer Science, University of Wisconsin, Madison.

163 citations


Journal ArticleDOI
TL;DR: It is proved that if the switch uses virtual output queueing and has an internal speedup of just four, it is possible for it to behave identically to an output-queued switch, regardless of the nature of the arriving traffic, and extended to show that with a small modification, the MUCFA algorithm enables perfect emulation of a variety of output scheduling policies, including strict priorities and weighted fair-queueing.

Journal ArticleDOI
TL;DR: This paper describes the architecture for a work-conserving server using a combined I/O-buffered crossbar switch that employs a novel algorithm based on output occupancy, the lowest occupancy output first algorithm (LOOFA), and a speedup of only two.
Abstract: This paper describes the architecture for a work-conserving server using a combined I/O-buffered crossbar switch. The switch employs a novel algorithm based on output occupancy, the lowest occupancy output first algorithm (LOOFA), and a speedup of only two. A work-conserving switch provides the same throughput performance as an output-buffered switch. The work-conserving property of the switch is independent of the switch size and input traffic pattern. We also present a suite of algorithms that can be used in combination with LOOFA. These algorithms determine the fairness and delay properties of the switch. We also describe a mechanism to provide delay bounds for real-time traffic using LOOFA. These delay bounds are achievable without requiring output-buffered switch emulation.

Journal ArticleDOI
TL;DR: A new object representation scheme called the discrete function representation (DFR) is designed to reduce the computational cost of both contact detection and the more difficult problem of contact resolution.
Abstract: This paper addresses the problem of contact detection in discrete element multibody dynamic simulations. We present an overview of the problem and a detail description of a new object representation scheme called the discrete function representation (DFR). This representation is designed to reduce the computational cost of both contact detection and the more difficult problem of contact resolution. The scheme has a maximum theoretical complexity of orderO(N) for contact resolution between bodies defined byN boundary points. In practice, the discrete element method constrains overlap between objects and the actual complexity is approximately $$O(\sqrt {(N)} $$ giving a speedup of nearly 2 orders of magnitude over traditional algorithms for systems with more than 1000 objects. The technique is robust and is able to handle convex and concave object geometries, including objects containing holes. Examples of relatively large discrete element simulations in three dimensions are presented.

Proceedings ArticleDOI
01 May 1999
TL;DR: Maps, a compiler managed memory system for Raw architectures, is implemented based on the SUIF infrastructure and it is demonstrated that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for regular applications and about 5-foldspeedup on 16 or more tiles for irregular applications.
Abstract: This paper describes Maps, a compiler managed memory system for Raw architectures. Traditional processors for sequential programs maintain the abstraction of a unified memory by using a single centralized memory system. This implementation leads to the infamous "Von Neumann bottleneck," with machine performance limited by the large memory latency and limited memory bandwidth. A Raw architecture addresses this problem by taking advantage of the rapidly increasing transistor budget to move much of its memory on chip. To remove the bottleneck and complexity associated with centralized memory, Raw distributes the memory with its processing elements. Unified memory semantics are implemented jointly by the hardware and the compiler. The hardware provides a clean compiler interface to its two inter-tile interconnects: a fast, statically schedulable network and a traditional dynamic network. Maps then uses these communication mechanisms to orchestrate the memory accesses for low latency and parallelism while enforcing proper dependence. It optimizes for speed in two ways: by finding accesses that can be scheduled on the static interconnect through static promotion, and by minimizing dependence sequentialization for the remaining accesses. Static promotion is performed using equivalence class unification and modulo unrolling; memory dependences are enforced through explicit synchronization and software serial ordering. We have implemented Maps based on the SUIF infrastructure. This paper demonstrates that the exclusive use of static promotion yields roughly 20-fold speedup on 32 tiles for our regular applications and about 5-fold speedup on 16 or more tiles for our irregular applications. The paper also shows that selective use of dynamic accesses can be a useful complement to the mostly static memory system.

Proceedings ArticleDOI
23 Mar 1999
TL;DR: Presents parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems and shows that the construction of a decision-Tree classifier can be effectively parallelized on an SMP machine with good speedup.
Abstract: Presents parallel algorithms for building decision-tree classifiers on shared-memory multiprocessor (SMP) systems. The proposed algorithms span the gamut of data and task parallelism. The data parallelism is based on attribute scheduling among processors. This basic scheme is extended with task pipelining and dynamic load balancing to yield faster implementations. The task-parallel approach uses dynamic subtree partitioning among processors. Our performance evaluation shows that the construction of a decision-tree classifier can be effectively parallelized on an SMP machine with good speedup.

Patent
12 Jan 1999
TL;DR: In this article, an arbitration scheme for providing deterministic bandwidth and delay guarantees in an input-buffered crossbar switch with speedup S is presented, where the arbitration scheme determines the sequence of fixed-size packet transmission between the input channels and output channels satisfying the constraint that only one cell can leave an input channel and enter an output channel per phase in such a way that the arbitration delay is bounded for each cell awaiting transmission at the input channel.
Abstract: An arbitration scheme for providing deterministic bandwidth and delay guarantees in an input-buffered crossbar switch with speedup S is presented. Within the framework of a crossbar architecture having a plurality of input channels and output channels, the arbitration scheme determines the sequence of fixed-size packet (or cell) transmission between the input channels and output channels satisfying the constraint that only one cell can leave an input channel and enter an output channel per phase in such a way that the arbitration delay is bounded for each cell awaiting transmission at the input channel. If the fixed-sized packets result from fragmentation of variable size packets, the scheduling and arbitration scheme determines deterministic delay guarantees to the initial variable size packets (re-assembled at the output channel) as well.

Proceedings ArticleDOI
01 Feb 1999
TL;DR: The architecture of a custom computing machine that overcomes the interconnection bottleneck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98 is described.
Abstract: As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that overcomes the interconnection bottleneck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98. The OneChip-98 system has a seamless programming model that enables the programmer to easily specify instructions without additional complex instruction decoding hardware. As well, there is a simple scheme for mapping instructions to the corresponding programming bits. To allow the processor and the reconfigurable array to execute concurrently, the programming model utilizes a novel memory-consistency scheme implemented in the hardware. To evaluate the feasibility of the OneChip-98 architecture, a 32-bit MIPS-like processor and several performance enhancement applications were mapped to the Transmogrifier-2 field programmable system. For two typical applications, the 2-dimensional discrete cosine transform and the 64-tap FIR filter, we were capable of achieving a performance speedup of over 30 times that of a stand-alone state-of-the-art processor.

05 Aug 1999
TL;DR: A new version of DAS PK, DASPK3.0, with capability for sensitivity analysis is presented in this report, and one of these features is an improved algorithm for calculation of consistent initial conditions for index-zero or index-one systems.
Abstract: A new version of DASPK, DASPK3.0, with capability for sensitivity analysis is presented in this report. DASPK3.0 differs from the sensitivity code DASPKSO, described in {MaPe96}, in several ways. DASPK3.0 has all the features, which were not available in DASPKSO, of the previous version DASPK2.0. One of these features is an improved algorithm for calculation of consistent initial conditions for index-zero or index-one systems. DASPK3.0 also incorporates a mechanism for initialization and solution of index-2 systems. Other improvements in DASPK3.0 include a more accurate error and convergence test, particularly for the sensitivity analysis. We implemented the Krylov method for sensitivity computation with a different strategy from DASPKSO, and made it more efficient and easier for parallel computing. We also added the staggered corrector method {FeBa97} for both the direct and Krylov method. We implemented the sensitivity analysis with an internal parallel mode, which is easy to use for both serial and parallel computation with message passing interface (MPI). We also incorporated automatic differentiation into DASPK3.0 to evaluate the Jacobian matrix and sensitivity equations. The goal of our design has been to be compatible as much as possible with DASPK2.0, to minimize memory and storage requirements for sensitivity analysis, and to speed up the computation for a large number of sensitivity parameters.

Journal ArticleDOI
TL;DR: An optimal order of convergence is shown for a wide range of two-phase flow problems including heterogeneous media and vanishing capillary pressure in an experimental way and a data parallel implementation of the Newton-multigrid algorithm with speedup results is presented.

Journal ArticleDOI
TL;DR: This paper compress the instruction segment of the executable running on the embedded system, and shows how to design a run-time decompression unit to decompress code on the fly before execution.
Abstract: In this paper, we present a method for reducing the memory requirements of an embedded system by using code compression. We compress the instruction segment of the executable running on the embedded system, and we show how to design a run-time decompression unit to decompress code on the fly before execution. Our algorithm uses arithmetic coding in combination with a Markov model, which is adapted to the instruction set and the application. We provide experimental results on two architectures, Analog Devices' Share and ARM's ARM and Thumb instruction sets, and show that programs can often be reduced by more than 50%. Furthermore, we suggest a table-based design that allows multibit decoding to speed up decompression.

Proceedings ArticleDOI
01 May 1999
TL;DR: This paper presents the design and implementation of a compiler that is designed to parallelize divide and conquer algorithms whose subproblems access disjoint regions of dynamically allocated arrays and shows that the programs perform well and exhibit good speedup.
Abstract: Divide and conquer algorithms are a good match for modern parallel machines: they tend to have large amounts of inherent parallelism and they work well with caches and deep memory hierarchies. But these algorithms pose challenging problems for parallelizing compilers. They are usually coded as recursive procedures and often use pointers into dynamically allocated memory blocks and pointer arithmetic. All of these features are incompatible with the analysis algorithms in traditional parallelizing compilers.This paper presents the design and implementation of a compiler that is designed to parallelize divide and conquer algorithms whose subproblems access disjoint regions of dynamically allocated arrays. The foundation of the compiler is a flow-sensitive, context-sensitive, and interprocedural pointer analysis algorithm. A range of symbolic analysis algorithms build on the pointer analysis information to extract symbolic bounds for the memory regions accessed by (potentially recursive) procedures that use pointers and pointer arithmetic. The symbolic bounds information allows the compiler to find procedure calls that can execute in parallel without violating the data dependences. The compiler generates code that executes these calls in parallel. We have used the compiler to parallelize several programs that use divide and conquer algorithms. Our results show that the programs perform well and exhibit good speedup.

Proceedings ArticleDOI
01 Feb 1999
TL;DR: A novel approach for runtime mapping is proposed that utilizes self-reconfigurability of multicontext FPGAs to achieve very high speedups over existing approaches and to demonstrate the feasibility of this approach, a detailed implementation of the KMP string matching algorithm is presented.
Abstract: FPGAs can perform better than ASICs if the logic mapped onto them is optimized for each problem instance. Unfortunately, this advantage is often canceled by the long time needed by CAD tools to generate problem instance dependent logic and the time required to configure the FPGAs. In this paper, a novel approach for runtime mapping is proposed that utilizes self-reconfigurability of multicontext FPGAs to achieve very high speedups over existing approaches. The key idea is to design and map logic onto a multicontext FPGA that in turn maps problem instance dependent logic onto other contexts of the same FPGA. As a result, CAD tools need to be used just once for each problem and not once for every problem instance as is usually done. To demonstrate the feasibility of our approach, a detailed implementation of the KMP string matching algorithm is presented which involves runtime construction of a finite state machine. We implement the KMP algorithm on a conventional FPGA (Xilinx XC 6216) and use it to obtain accurate estimates of performance on a multicontext device. Speedups in mapping time of ≈ 10⁶ over CAD tools and more than 1800 over a program written specifically for FSM generation were obtained. Significant speedups were obtained in overall execution time as well, including a speedup ranging from 3 to 16 times over a software implementation of the KMP algorithm running on a Sun Ultra 1 Model 140 workstation.

Journal ArticleDOI
TL;DR: This paper describes the experience with a simple modeling and programming approach for increasing the amount of constraint propagation in the constraint solving process, and introduces the notions of CSP model and model redundancy, and shows how mutually redundant models can be combined and connected using channeling constraints.
Abstract: This paper describes our experience with a simple modeling and programming approach for increasing the amount of constraint propagation in the constraint solving process The idea, although similar to redundant constraints, is based on the concept of redundant modeling We introduce the notions of CSP model and model redundancy, and show how mutually redundant models can be combined and connected using channeling constraints The combined model contains the mutually redundant models as sub-models Channeling constraints allow the sub-models to cooperate during constraint solving by propagating constraints freely amongst the sub-models This extra level of pruning and propagation activities becomes the source of execution speedup real-life nurse rostering system We perform two case studies to evaluate the effectiveness and efficiency of our method The first case study is based on the simple and well-known n-queens problem, while the second case study applies our method in the design and construction of a real-life nurse rostering system Experimental results provide empirical evidence in line with our prediction

Dissertation
01 Jan 1999
TL;DR: This paper focuses on main programming issues associated with designing of the latest version of AKAROA2, a simulation package in which network computing is applied in a practical and user-friendly way based on MRIP scenario of distributed simulation.
Abstract: In this paper we discuss an application of network computing in the area of stochastic simulation. We focus on main programming issues associated with designing of the latest version of AKAROA2, a simulation package in which network computing is applied in a practical and user-friendly way. This implemention is based on Multiple Replications In Parallel (MRIP) scenario of distributed simulation, in which multiple computers of a network operate as concurrent simulation engines generating statistically equivalent simulation output data, and submitting them to global data analysers responsible for analysis of the final results and for stopping the simulation. The MRIP scenario can achieve speedup equal to the number of processors used.

Proceedings ArticleDOI
01 May 1999
TL;DR: The Critical Channel Traversing algorithm is introduced, a new scheduling algorithm for both sequential and parallel discrete event simulation for low-granularity network models on shared-memory multiprocessor computers and performance is enhanced by supporting good cache behavior and automatic load balancing.
Abstract: This paper introduces the Critical Channel Traversing (CCT) algorithm, a new scheduling algorithm for both sequential and parallel discrete event simulation. CCT is a general conservative algorithm that is aimed at the simulation of low-granularity network models on shared-memory multiprocessor computers. An implementation of the CCT algorithm within a kernel called TasKit has demonstrated excellent performance for large ATM network simulations when compared to previous sequential, optimistic and conservative kernels. TasKit has achieved two to three times speedup on a single processor with respect to a splay tree central-event-list based sequential kernel. On a 16 processor (R8000) Silicon Graphics PowerChallenge, TasKit has achieved an event-rate of 1.2 million events per second and a speedup of 26 relative to the sequential kernel for a large ATM network model. Performance is achieved through a multi-level scheduling scheme that supports the scheduling of large grains of computation even with low-granularity events. Performance is also enhanced by supporting good cache behavior and automatic load balancing. The paper describes the algorithm and its motivation, proves its correctness and briefly presents performance results for TasKit.

Book ChapterDOI
31 Aug 1999
TL;DR: The realization of a parallel version of the k/h-means clustering algorithm is described and it is shown how a database can be distributed and how the algorithm can be applied to this distributed database.
Abstract: This paper describes the realization of a parallel version of the k/h-means clustering algorithm. This is one of the basic algorithms used in a wide range of data mining tasks. We show how a database can be distributed and how the algorithm can be applied to this distributed database. The tests conducted on a network of 32 PCs showed for large data sets a nearly ideal speedup.

Proceedings Article
29 Nov 1999
TL;DR: An adaptive recomputation strategy is introduced that is shown to speedup search while keeping memory consumption low and outperforms trailing on large problems with respect to both space and time.
Abstract: A central service of a constraint programming system is search. In almost all constraint programming systems search is based on trailing, which is well understood and known to be efficient. This paper compares trailing to copying. Copying offers more expressiveness as required by parallel and concurrent systems. However, little is known how trailing compares to copying as it comes to implementation effort, runtime efficiency, and memory requirements. This paper discusses these issues. Execution speed of a copying-based system is shown to be competitive with state-of-the-art trailing-based systems. For the first time, a detailed analysis and comparison with respect to memory usage is made. It is shown how recomputation decreases memory requirements which can be prohibitive for large problems with copying alone. The paper introduces an adaptive recomputation strategy that is shown to speedup search while keeping memory consumption low. It is demonstrated that copying with recomputation outperforms trailing on large problems with respect to both space and time.

27 Aug 1999
TL;DR: It is demonstrated that scaling the architecture leads to near linear application speedup, and the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance is evaluated.
Abstract: Next generation portable devices will require processors with both low energy consumption and high performance for media functions. At the same time, modern CMOS technology creates the need for highly scalable VLSI architectures. Conventional processor architectures fail to meet these requirements. This paper presents the architecture of Vector IRAM (VIRAM), a processor that combines vector processing with embedded DRAM technology. Vector processing achieves high multimedia performance with simple hardware, while embedded DRAM provides high memory bandwidth at low energy consumption. VIRAM provides flexible support for media data types, short vectors, and DSP features. The vector pipeline is enhanced to hide DRAM latency without using caches. The peak performance is 3.2 GFLOPS (single precision) and maximum memory bandwidth is 25.6 GBytes/s. With a target power consumption of 2 Watts for the vector pipeline and the memory system, VIRAM supports 1.6 GFLOPS/Watt. For a set of representative media kernels, VIRAM sustains on average 88% of its peak performance, outperforming conventional SIMD media extensions and DSP processors by factors of 4.5 to 17. Using a clustered implementation approach, the modular design can be scaled without complicating control logic. We demonstrate that scaling the architecture leads to near linear application speedup. We also evaluate the effect of scaling the capacity and parallelism of the on-chip memory system to die area and sustained performance.

Proceedings ArticleDOI
28 Jul 1999
TL;DR: STB (Shape To Bit-vector) indexing is introduced, and experimentally validate it on space telemetry, medical and synthetic data, demonstrating approximately an order-of-magnitude speedup.
Abstract: Addresses the problem of similarity searching in large time-series databases. We introduce a novel indexing algorithm that allows faster retrieval. The index is formed by creating bins that contain time series subsequences of approximately the same shape. For each bin, we can quickly calculate a lower bound on the distance between a given query and the most similar element of the bin. This bound allows us to search the bins in best-first order, and to prune some bins from the search space without having to examine the contents. Additional speedup is obtained by optimizing the data within the bins such that we can avoid having to compare the query to every item in the bin. We call our approach STB (Shape To Bit-vector) indexing, and experimentally validate it on space telemetry, medical and synthetic data, demonstrating approximately an order-of-magnitude speedup.

Patent
09 Apr 1999
TL;DR: In this article, an apparatus for processing data has a Single-Instruction-Multiple-Data (SIMD) architecture, and a number of features that improve performance and programmability.
Abstract: An apparatus for processing data has a Single-Instruction-Multiple-Data (SIMD) architecture, and a number of features that improve performance and programmability. The apparatus includes a rectangular array of processing elements and a controller. In one aspect, each of the processing elements includes one or more addressable storage means and other elements arranged in a pipelined architecture. The controller includes means for receiving a high level instruction, and converting each instruction into a sequence of one or more processing element microinstructions for simultaneously controlling each stage of the processing element pipeline. In doing so, the controller detects and resolves a number of resource conflicts, and automatically generates instructions for registering image operands that are skewed with respect to one another in the processing element array. In another aspect, a programmer references images via pointers to image descriptors that include the actual addresses of various bits of multi-bit data. Other features facilitate and speed up the movement of data into and out of the apparatus. 'Hit' detection and histogram logic are also included.