scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1997"


Proceedings ArticleDOI
01 Jan 1997
TL;DR: The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering.
Abstract: This paper presents the first algorithm to find the optimal affine transform that maximizes the degree of parallelism while minimizing the degree of synchronization in a program with arbitrary loop nestings and affine data accesses. The problem is formulated without the use of imprecise data dependence abstractions such as data dependence vectors. The algorithm presented subsumes previously proposed program transformation algorithms that are based on unimodular transformations, loop fusion, fission, scaling, reindexing and/or statement reordering.

215 citations


01 Jan 1997
TL;DR: The paper presents the representations, the operators and the interpreters used in PDGP, and describes experiments in which PDGP has been compared to standard GP.
Abstract: Parallel Distributed Genetic Programming (PDGP) is a new form of Genetic Programming (GP) suitable for the development of programs with a high degree of parallelism. Programs are represented in PDGP as graphs with nodes representing functions and terminals, and links representing the flow of control and results. The paper presents the representations, the operators and the interpreters used in PDGP, and describes experiments in which PDGP has been compared to standard GP.

118 citations


Proceedings ArticleDOI
16 Apr 1997
TL;DR: The paper discusses a mapping experiment where a linear-systolic implementation of an ATR algorithm is mapped to the SPLASH 2 platform, and the resulting design is scalable and can be spread across multiple SPLash 2 boards with a linear increase in performance.
Abstract: Automated target recognition is an application area that requires special-purpose hardware to achieve reasonable performance. FPGA-based platforms can provide a high level of performance for ATR systems if the implementation can be adapted to the limited FPGA and routing resources of these architectures. The paper discusses a mapping experiment where a linear-systolic implementation of an ATR algorithm is mapped to the SPLASH 2 platform. Simple column oriented processors were used throughout the design to achieve high performance with limited nearest neighbor communication. The distributed SPLASH 2 memories are also exploited to achieve a high degree of parallelism. The resulting design is scalable and can be spread across multiple SPLASH 2 boards with a linear increase in performance.

59 citations


Journal ArticleDOI
01 May 1997
TL;DR: This paper primarily focuses on the distributed data layout and scheduling techniques developed as a part of the Massively-parallel And Real-time Storage (MARS) project, which support a high degree of parallelism and concurrency, and efficiently implement various playout control operations.
Abstract: Large-scale on-demand multimedia servers thatcan provide independent and interactive access to a vastamount of multimedia information to a large number ofconcurrent clients will be required for a widespread deployment of exciting multimedia applications. Our project, calledMassively-parallel And Real-time Storage (MARS) is aimedat prototype development of such a large-scale server. Thispaper primarily focuses on the distributed data layout andscheduling techniques developed as a part of this project.These techniques support a high degree of parallelism andconcurrency, and efficiently implement various playout control operations, such as fast forward, rewind, pause, resume,frame advance and random access.

37 citations


Journal ArticleDOI
TL;DR: It is proved that only an amortized constant amount of rebalancing is necessary after an update in a chromatic search tree, and it is shown that the amount ofRebalancing done at any particular level decreases exponentially, going from the leaves toward the root.

35 citations


Journal ArticleDOI
TL;DR: In this paper, it is shown that the inertia matrix associated with any open-or closed-loop mechanism is positive definite by finding a simple mathematical expression for the quadratic form expressing the kinetic energy in an associated state space.
Abstract: In this paper, advantage is taken of the problem structure in multibody dynamics simulation when the mechanical system is modeled using a minimal set of generalized coordinates. It is shown that the inertia matrix associated with any open- or closed-loop mechanism is positive definite by finding a simple mathematical expression for the quadratic form expressing the kinetic energy in an associated state space. Based on this result, an algorithm that efficiently solves for second time derivatives of the generalized coordinates is presented. Significant speed-ups accrue due to both the no fill-in factorization of the composite inertia matrix technique and the degree of parallelism attainable with the new algorithm.

20 citations


Proceedings ArticleDOI
18 Dec 1997
TL;DR: This paper describes a heuristic algorithm for deciding the number of times and the directions in which loops should be unrolled, through the use of information such as dependence, reuse, and machine resources.
Abstract: Loop unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their loops. Unrolling not only the innermost loop but also outer loops greatly expands the scope for reusing data and parallelizing instructions. Nested-loop unrolling is therefore a very effective way of obtaining a higher degree of parallelism. However, we need a method for measuring the efficiency of loop unrolling that takes account of both the reuse of data and the parallelism between instructions. This paper describes a heuristic algorithm for deciding the number of times and the directions in which loops should be unrolled, through the use of information such as dependence, reuse, and machine resources. Our method is evaluated by applying benchmark tests.

19 citations


Journal ArticleDOI
23 Jun 1997
TL;DR: It is shown that, due to the diminishing returns from a further increase in ILP, multimedia applications will benefit more from an additional exploitation of parallelism at thread-level, and how simultaneous multithreading (SMT), a novel architectural approach combining VLIW techniques with parallel processing of threads, can efficiently be used to further increase performance of typical multimedia workloads.
Abstract: A number of recently published DSPs and multimedia processors emphasize on Very Long Instruction Word (VLIW) architectures to achieve flexibility, processing power and high-level language programmability needed for future multimedia applications. In this paper we show that exclusive exploitation of instruction level parallelism decreases in efficiency as the degree of parallelism increases. This is mainly caused by algorithm characteristics, VLSI design and compiler restrictions. We discuss selected aspects from these fields and possible solutions to upcoming bottlenecks from a practical point of view.

16 citations


Journal ArticleDOI
01 Apr 1997
TL;DR: The goal is to extract as many parallel loops as the intrinsic degree of parallelism of the nest authorizes, while avoiding a full memory expansion.
Abstract: In this paper we shortly survey some loop transformation techniques which break anti or output dependences, or artificial cycles involving such ‘false’ dependences. These false dependences are removed through the introduction of temporary buffer arrays. Next we show how to plug these techniques into loop parallelization algorithms (such as Allen and Kennedy's algorithm). The goal is to extract as many parallel loops as the intrinsic degree of parallelism of the nest authorizes, while avoiding a full memory expansion. We try to reduce the number of temporary arrays that we introduce, as well as their dimension.

14 citations


Book ChapterDOI
23 Sep 1997
TL;DR: A theory and proof rules for the refinement of action systems that communicate via remote procedures based on the data refinement approach are developed and the atomicity refinement of actions is studied.
Abstract: Recently the action systems formalism for parallel and distributed systems has been extended with the procedure mechanism. This gives us a very general framework for describing different communication paradigms for action systems, e.g. remote procedure calls. Action systems come with a design methodology based on the refinement calculus. Data refinement is a powerful technique for refining action systems. In this paper we will develop a theory and proof rules for the refinement of action systems that communicate via remote procedures based on the data refinement approach. The proof rules we develop are compositional so that modular refinement of action systems is supported. As an .example we will especially study the atomicity refinement of actions. This is an important refinement strategy, as it potentially increases the degree of parallelism in an action system.

14 citations


Journal ArticleDOI
TL;DR: A degree of parallelism is an equivalence class of Scott-continuous functions which are relatively definable by each other with respect to the language PCF (a paradigmatic sequential language).

Journal ArticleDOI
Liang Chen1
TL;DR: The key idea is that the customer departure times are represented by longest-path distance in directed graphs instead of by the usual recursive equations, which leads to scalable algorithms with a high degree of parallelism that can be implemented on either MIMD or SIMD parallel computers.
Abstract: This paper presents several basic algorithms for the parallel simulation of G/G/1 queueing systems and certain networks of such systems. The coverage includes systems subject to manufacturing or communication blocking, or to loss of customer due to capacity constraints. The key idea is that the customer departure times are represented by longest-path distance in directed graphs instead of by the usual recursive equations. This representation leads to scalable algorithms with a high degree of parallelism that can be implemented on either MIMD or SIMD parallel computers.

Book ChapterDOI
07 Apr 1997
TL;DR: The representations, the operators and the interpreters used in PDGP are described, and how these can be tailored to evolve RTN-based recognisers are described.
Abstract: This paper describes an application of Parallel Distributed Genetic Programming (PDGP) to the problem of inducing recognisers for natural language from positive and negative examples. PDGP is a new form of Genetic Programming (GP) which is suitable for the development of programs with a high degree of parallelism and an efficient and effective reuse of partial results. Programs are represented in PDGP as graphs with nodes representing functions and terminals, and links representing the flow of control and results. PDGP allows the exploration of a large space of possible programs including standard tree-like programs, logic networks, neural networks, finite state automata, Recursive Transition Networks (RTNs), etc. The paper describes the representations, the operators and the interpreters used in PDGP, and describes how these can be tailored to evolve RTN-based recognisers.

Patent
15 Jul 1997
TL;DR: In this article, an inherently serial program is processed in parallel, thus leading to higher processing speeds, while maintaining a close approximation to the specific result obtained through a serial running of the program.
Abstract: An inherently serial program is processed in parallel, thus leading to higher processing speeds, while maintaining a close approximation to the specific result obtained through a serial running of the program. This goal has been attained based on the fact that the desired degree of closeness between a parallel result and the serial result depends on the particular inherently serial program being run and the type of analysis being performed. That is, some inherently serial processes require a “fine-tuned” result while for others a “coarser” result is acceptable. The frequency at which the parallel branches consolidate their respective results is changed accordingly to alter the degree of closeness between the parallel processed result and the serially processed result.

Proceedings ArticleDOI
27 Mar 1997
TL;DR: In this paper, a hybrid shape recognition system with an optical Hough transform preprocessor is presented, which is achieved by a micro-lens array processor using incoherent light, the processor accepts direct optical input without any extra image converter being required.
Abstract: We present a hybrid shape recognition system with an optical Hough transform preprocessor. A very compact design is achieved by a microlens array processor. Using incoherent light, the processor accepts direct optical input without any extra image converter being required. The microlens array processor is constructed of a crossed assembly of two low-cost plastic lenticular arrays and a Hough transform weight mask. It is integrated in a compact objective barrel, which is attached directly to a CCD-camera like a conventional camera lens. The system delivers one output signal for each of the 64 X 64 microlenses. The resolution of the microlenses and the weight mask results in an extremely high degree of parallelism. It corresponds to a connection of 4k inputs and outputs by 16M weights in parallel. The feature extraction tasks of lower computational complexity and the classification, which can be performed in real-time, are implemented as a neural network on a personal computer.

Patent
11 Mar 1997
TL;DR: In this paper, a CRC code generation circuit for suppressing a circuit scale and operational delay even when the number of bits in the information data of error detection object can not be just divided by the degree of parallelism in CRC arithmetic in the case of generating CRC code.
Abstract: PROBLEM TO BE SOLVED: To provide a CRC code generation circuit for suppressing a circuit scale and operational delay even when the number of bits in the information data of error detection object can not be just divided by the degree of parallelism in CRC arithmetic in the case of generating a CRC code and to provide a CRC code generation circuit design method to easily design the CRC code generation circuit, for which the circuit scale and the operational delay are suppressed, from the combination of the arbitrary number of information bits, the number of redundant bits and the degree of parallelism in CRC arithmetic. SOLUTION: Assuming that the number of bits in the source information data can not be just divided by the degree 8 of parallelism in arithmetic for generating the CRC code of 10 bits, when its remainder is defined as (h) and the cycle of a generation polynomial G(X) for CRC code generation is defined as (n), (e) to be e=(b-c+h) moden is found and the '0' symbol of (c-h) bit is added to the end of information data. Further, after XΛe is multiplied, that multiplied result is divided by the generation polynomial G(X) and as its remainder, the CRC code is outputted from flip-flop circuits F0-F9.

01 Jan 1997
TL;DR: This work proposes multidimensional piecewise regular arrays: arrays of loosely connected subarrays of lower dimensionality where two different clock rates are used, and introduces a method for developing pipestructures for spreading the shared data between distinct computations and for gathering partial results in the case of a reduction operator.
Abstract: Regular arrays, particularly systolic arrays, have been the subject of continuous interest for the past 15 years One reason is that they present an excellent example of the unity between hardware and software, especially for application-specific computations This results in a cost effective implementation of systolic algorithms in hardware, in VLSI chips or on FPGAs To the present time, systolic/regular arrays have primarily been considered as 2-D structures The chief purposes of this work are: (i) to develop methods to transform an algorithm into a form that fits the 3-D physical construction of the processor and is easy to fabricate; (ii) to find ways of increasing the available degree of parallelism and thus improve scalability and latency For this purpose, we propose multidimensional piecewise regular arrays: arrays of loosely connected subarrays of lower dimensionality where two different clock rates are used One, a high clock rate, is used inside subarrays, eg inside VLSI chips, and the other, a low clock rate, is used in the interconnection part of subarrays These properties permit easy physical realization of n-D large arrays, as the n-D array is formed from (n-1)-D subarrays that are connected to each other only by edges using a low clock rate Thus, 3-D arrays that consist of 2-D arrays are easily fabricated, eg using multichip modules, wafer scale integration etc While several of the approaches that we use to achieve our aims have been considered in the literature, they have unfortunately been studied separately and without a unified approach We combine our approach with commonly used synthesizing methods for regular arrays: with space-time transformations on polytopes The approach we propose can be used for all associative and commutative problems The thesis presents the synthesis of large variety of new, higher-dimensional arrays The two main issues involved in addition to the existing methods in the polytope model are: (1) In order to achieve a higher degree of parallelism, and to decrease latency, we increase the dimensionality of the source representation of the program by partitioning the range of indices (2) We introduce a method for developing pipestructures (an extension of pipelines) for spreading the shared data between distinct computations and for gathering partial results in the case of a reduction operator As an example, we consider template matching on systolic arrays A 2-D mesh of linear arrays --- conventional systolic arrays for 1-D convolution --- that exploits two different clock rates is presented

Journal ArticleDOI
25 Feb 1997
TL;DR: A memory?computer optoelectronic interface is defined and its functionality is analyzed with respect to its degree of parallelism and single-instruction multiple-data processing paradigms based on such interfaces are discussed.
Abstract: The output of parallel optical memories is a two-dimensional set of data which propagates along the third dimension. Based on this characteristic we define a memory?computer optoelectronic interface and we analyze its functionality with respect to its degree of parallelism. Single-instruction multiple-data processing paradigms based on such interfaces are also discussed.

Book ChapterDOI
01 Jan 1997
TL;DR: This chapter presents a unified analysis of decomposition algorithms for continuously differentiable optimization problems defined on Cartesian products of convex feasible sets using the framework of cost approximation algorithms.
Abstract: This chapter presents a unified analysis of decomposition algorithms for continuously differentiable optimization problems defined on Cartesian products of convex feasible sets. The decomposition algorithms are analyzed using the framework of cost approximation algorithms. A convergence analysis is made for three decomposition algorithms: a sequential algorithm which encompasses the classical Gauss—Seidel scheme, a synchronized parallel algorithm which encompasses the Jacobi method, and a partially asynchronous parallel algorithm. The analysis validates inexact computations in both the subproblem and line search phases. The range of feasible step lengths within each algorithm is shown to have a direct correspondence to the increasing degree of parallelism and asynchronism, and the resulting usage of more outdated information in the algorithms.

Book ChapterDOI
10 Sep 1997
TL;DR: Initial performance results that have been obtained using the GranSim simulator are reported, showing that a modest but useful degree of parallelism can be achieved even for a distributed-memory machine.
Abstract: Naira is a compiler for a parallel dialect of Haskell, compiling to a graph-reducing parallel abstract machine with a strong dataflow influence. Unusually (perhaps even uniquely), Naira has itself been parallelised using state-of-the-art tools developed at Glasgow and St Andrews Universities. Thus Naira is a parallel, parallelising compiler in one. This paper reports initial performance results that have been obtained using the GranSim simulator, both for the top-level pipeline and for individual compilation stages. We show that a modest but useful degree of parallelism can be achieved even for a distributed-memory machine. The simulation results have been verified on a network of distributed workstations using the GUM parallel implementation of Haskell.

Book ChapterDOI
23 Oct 1997
TL;DR: In the last few years the world of computers has evolved in two directions: conventional machines with an ever greater degree of parallelism and machines with increasingly higher Machine Intelligence Quotients (MIQs) - the element which separates these two categories is accuracy.
Abstract: In the last few years the world of computers has evolved in two directions: conventional machines with an ever greater degree of parallelism and machines with increasingly higher Machine Intelligence Quotients (MIQs) [1]. The element which separates these two categories is accuracy. The former, in fact, rely on Hard Computing (HC), i.e. computation based on mathematical accuracy, while the latter are based on Soft Computing (SC), which relies on the lower computational cost inherent in imprecision.

01 Sep 1997
TL;DR: This report develops one of the more promising options for computing the full image of high-resolution underwater acoustic imaging using multi-element arrays using either a square root or of a fifth degree polynomial.
Abstract: : High-resolution underwater acoustic imaging using multi-element arrays implies a large computational load. For a three-dimensional viewing volume resolved into 3x10(exp 9) voxels (volume pixels), with 4000 elements, the computations needed are around 9x(1.2x10(exp 13)) floating-point operations. This report develops one of the more promising options for computing the full image. First, parallel computation is used to deal with the different sensor elements simultaneously, when calculating the address of the appropriate instantaneous voltage at the sensor element-or, equivalently, the calculation of the round-trip distance traveled by the acoustic pulse. This calculation requires, in a typical near-field situation, the computation either of a square root or of a fifth degree polynomial. This polynomial allows increased parallelism. Second, the summation in the beamforming is likewise done with a high degree of parallelism. A machine with the above design, with 10(exp 9) clock cycles per second, would compute the entire image in roughly 6 seconds. Cost and availability are not investigated.

Proceedings ArticleDOI
03 Jan 1997
TL;DR: The problem of finding a computation mapping and data distributions that minimize the number of remote data accesses for a given degree of parallelism is studied and the algorithm presented is shown to be superior to more basic mappings.
Abstract: Two important aspects have to be addressed when automatically parallelizing loop nests for massively parallel distributed memory computers, namely maximizing parallelism and minimizing communication overhead due to nonlocal data accesses. This paper studies the problem of finding a computation mapping and data distributions that minimize the number of remote data accesses for a given degree of parallelism. This problem is called the constant-degree parallelism alignment problem and is shown to be NP-hard. The algorithm presented uses a linear algebra framework and assumes affine data access functions. It proceeds by enumerating all interesting bases of the set of vectors representing the alignments between computation and data accesses that should be satisfied. It is shown in a comparison with related work how the approach presented allows one to express previous results as special cases. The algorithm is applied to benchmark programs and is shown to be superior to more basic mappings.

Journal ArticleDOI
TL;DR: This parallel algorithm is based on a numerically stable method that transforms the matrices of the system to block Hessenberg form that requires one-sided orthogonal transformations and usual data layouts as 1-D and 2-D block column/row wrap are appropriate.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the relationship between the information bandwidth, the optical power efficiency, and the degree of parallelism for optical interconnection architectures that employ optical fan-in.
Abstract: We present an investigation of relationships among the information bandwidth, the optical power efficiency, and the degree of parallelism for optical interconnection architectures that employ optical fan-in. The foundation for these relationships is the Lagrange invariant, or, more specifically, the constant-radiance theorem. We show that, when restrictions imposed by the constant-radiance theorem are combined with requirements on the probability of error, an upper limit is placed on the bandwidth that is reduced as the fan-in ratio increases. These limitations are significantly more severe when optical fan-in is used to perform analog summations. We then define a measure of processing efficiency that takes into account the influence of optical input power on the probability of error and is used to interpret the results.

Journal ArticleDOI
01 Dec 1997
TL;DR: Techniques for improving server capacity by assigning media requests to the nodes of a server so as to balance the load on the interconnection network and the scheduling nodes are presented.
Abstract: A server for an interactive distributed multimedia system may require thousands of gigabytes of storage space and high I/O bandwidth. In order to maximize system utilization, and thus minimize cost, the load must be balanced among the server's disks, interconnection network and scheduler. Many algorithms for maximizing retrieval capacity from the storage system have been proposed. This paper presents techniques for improving server capacity by assigning media requests to the nodes of a server so as to balance the load on the interconnection network and the scheduling nodes. Five policies for dynamic request assignment are developed. An important factor that affects data retrieval in a high-performance continuous media server is the degree of parallelism of data retrieval. The performance of the dynamic policies on an implementation of a server model developed earlier is presented for two values of the degree of parallelism.

01 Jan 1997
TL;DR: This thesis outlines a cost-effective multiprocessor architecture that takes into consideration the importance of system costs as well as delivered performance, and establishes that HPAM machines can have higher cost-efficiency than the optimal homogeneous machine for a given application.
Abstract: This thesis outlines a cost-effective multiprocessor architecture that takes into consideration the importance of system costs as well as delivered performance. The proposed architecture, HPAM, is organized as a Hierarchy of Processor-And-Memory homogeneous subsystems. Across the levels of the hierarchy, processor speeds and interconnection technology vary. The proposed multilevel processor configuration uses fast and costly resources sparingly to reduce sequential and low parallelism bottlenecks. The resulting organization tries to balance cost, speed and parallelism granularity. Two temporal (instruction and data) locality principles with respect to the degree of parallelism are identified and empirically established for a set of programs. These principles suggest the desirability of a hierarchical approach to cost-effective high-performance computing. In order to conduct detailed analysis of the different features of HPAM machines, a simulator was developed. This simulator (HPAM$\sb-$Sim) allows the simulation of target machines consisting of different processors and interconnection networks in either contention or non-contention modes. Using HPAM$\sb-$Sim, a simulation-based study of the performance achieved by mapping compiler and hand-parallelized versions of the CMU benchmarks onto different HPAM machines was conducted. This study establishes that (1) HPAM machines can have higher cost-efficiency than the optimal homogeneous machine for a given application; (2) HPAM machines can benefit from hardware and software support for reconfiguring three-level and two-level machines into two-level and one-level organizations; (3) the performance of a given application, when executed on an HPAM machine, is dictated not only by the degree of parallelism but also the ratio of total communication time to total computation time of the application; and (4) efficient implementations of collective communication operations can improve the performance of HPAM machines. This last fact led to the study of three important collective communication operations, namely broadcast, scatter and gather, in the context of an HPAM machine.

Book ChapterDOI
01 Sep 1997
TL;DR: It is shown, how overall system performance can be increased by appropriate data distribution, and the system performance is enhanced by avoiding unnecessary data transfers.
Abstract: The paper presents a general data and task scheduling technique for parallel accelerators with one or more processing modules and the capability for local and shared memory access. Multiple tasks and their data are mapped onto the processing modules and the host, providing a high degree of parallelism. The system performance is enhanced by avoiding unnecessary data transfers. It is shown, how overall system performance can be increased by appropriate data distribution.

Book ChapterDOI
07 Aug 1997
TL;DR: A detailed case study of programming in a parallel programming system which targets complete and controlled parallelization of array-oriented computations to demonstrate how coherent integration of control and data parallelism enables both effective realization of the potential parallelism of applications and matching of the degree of parallelism in a program to the resources of the execution environment.
Abstract: This paper presents a detailed case study of programming in a parallel programming system which targets complete and controlled parallelization of array-oriented computations. The purpose is to demonstrate how coherent integration of control and data parallelism enables both effective realization of the potential parallelism of applications and matching of the degree of parallelism in a program to the resources of the execution environment. (“Our ability to reason is constrained by the language in which we reason.”) The programming system is based on an integrated graphical, declarative representation of control parallelism and data partitioning parallelism. The computation used for the example is even-odd reduction of block tridiagonal matrices. This computation has three phases, each with a different parallel structure. We derive, implement and measure the execution of a dynamic parallel computation structure which employs differing levels of control and data parallelism in each phase of the computation to give load balanced execution across a range of number of processors. The program formulated in the integrated representation revealed parallelism not shown in the original algorithm and has a constant level of actual parallelism throughout the computation where the original algorithm had unbalanced levels of parallelism during different phases of the computation. The resulting program shows near-linear speed-up across all phases of the computation for number of processors ranging from 2 to 32.

Proceedings ArticleDOI
A. Elahi1
12 Apr 1997
TL;DR: The architecture selection factors described, certainly can help in improving the I/O and processor bound limitations and degrade the system performance in terms of the real time requirement and quality of presentation and future expansion.
Abstract: Some application-specific algorithms exhibit very high processing demand due to the real-time requirement. Architectural structure adaptations are necessary to meet the computational requirement of such algorithms. An optimized architecture should consider all the performance criteria and their degree of importance. Among these, the degree of parallelism, memory and communication bandwidth, and I/O rate should be considered. A functional analysis of several image processing algorithms is used as an example to find a parallel architecture to meet the requirements. The architecture selection factors described, certainly can help in improving the I/O and processor bound limitations. Their lack of consideration can degrade the system performance in terms of the real time requirement and quality of presentation and future expansion.