scispace - formally typeset
Search or ask a question

Showing papers on "Degree of parallelism published in 1993"


Proceedings ArticleDOI
01 May 1993
TL;DR: The goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication and develop analytical models for the effects of changing the problem size and the degree of parallelism.
Abstract: This paper studies the behavior of scientific applications running on distributed memory parallel computers. Our goal is to quantify the floating point, memory, I/O and communication requirements of highly parallel scientific applications that perform explicit communication. In addition to quantifying these requirements for fixed problem sizes and numbers of processors, we develop analytical models for the effects of changing the problem size and the degree of parallelism for several of the applications. We use the results to evaluate the trade-offs in the design of multicomputer architectures.

141 citations


Proceedings ArticleDOI
01 Dec 1993
TL;DR: A parallel algorithm for constructing the Delaunay triangulation of a set of vertices in three-dimensional space is presented, which achieves a fast running time and good scalability over a wide range of problem sizes and machine sizes.
Abstract: A parallel algorithm for constructing the Delaunay triangulation of a set of vertices in three-dimensional space is presented. The algorithm achieves a high degree of parallelism by starting the construction from every vertex and expanding over all open faces thereafter. In the expansion of open faces, the search is made faster by using a bucketing technique. The algorithm is designed under a data-parallel paradigm. It uses segmented list structures and virtual processing for load-balancing. As a result, the algorithm achieves a fast running time and good scalability over a wide range of problem sizes and machine sizes. A topological check is incorporated to eliminate inconsistencies due to degeneracies and numerical errors. The algorithm is implemented on Connection Machines CM-2 and CM-5, and experimental results are presented.

29 citations


Journal ArticleDOI
01 Mar 1993
TL;DR: It turns out that several geometric feature extraction algorithms such as moment generation can be represented in this scheme so that the inherent information growing of the algorithms enables the exploitation of bit-level concurrency in the architectural design.
Abstract: In computer vision and image processing, the high degree of parallelism and pipelining of algorithms is often obstructed by the raster-scan I/O constraint and the information growing property of multiresolution structures. The approach of formulating algorithms in the pyramid structure as a binary tree structure, and mapping the binary tree structure into a linear pipelined array of 2logN levels for N*N images using a first-in, first-out technique (FIFO) to emulate the tree connections is proposed. It turns out that several geometric feature extraction algorithms such as moment generation can be represented in this scheme so that the inherent information growing of the algorithms enables the exploitation of bit-level concurrency in the architectural design. Consequently, the design of pipelined processor at each level is significantly simplified using bit-serial arithmetic, and this VLSI architecture is capable of generating moments concurrently in real-time. >

21 citations


Book ChapterDOI
01 Jan 1993
TL;DR: This paper presents a survey of high-performance switch fabric architectures which incorporate fast packet switching as their underlying switching technique to handle various traffic types.
Abstract: The rapid evolution in the field of telecommunications has led to the emergence of new switching technologies to support a variety of communication services with a wide range of transmission rates in a common, unified integrated services network. At the same time, the progress in the field of VSLI technology has brought up new design principles of high-performance, high-capacity switching fabrics to be used in the integrated networks of the future. Most of the recent proposals for such high-performance switching fabrics have been based on a principle known as fast packet switching. This principle employs a high degree of parallelism, distributed control, and routing performed at the hardware level. In this paper, we present a survey of high-performance switch fabric architectures which incorporate fast packet switching as their underlying switching technique to handle various traffic types. Our intention is to give a descriptive overview of the major activities in this rapidly evolving field of telecommunications.

20 citations


Book ChapterDOI
30 Aug 1993
TL;DR: The present work generalizes the static treatment of Aceto to full CCS, and produces a distributed semantics which yields finite transition systems for all CCS processes with a regular behaviour and a finite degree of parallelism.
Abstract: The distributed structure of CCS processes can be made explicit by assigning different locations to their parallel components. These locations then become part of what is observed of a process. The assignment of locations may be done statically, or dynamically as the execution proceeds. The dynamic approach was developed first, by Boudol et al. in [BCHK91a], [BCHKSlb], as it seemed more convenient for defining notions of location equivalence and preorder. However, it has the drawback of yielding infinite transition system representations. The static approach, which is more intuitive but technically more elaborate, was later developed by L. Aceto [Ace91] for nets of automata, a subset of CCS where parallelism is only allowed at the top level. In this approach each net of automata has a finite representation, and one may derive notions of equivalence and preorder which coincide with the dynamic ones. The present work generalizes the static treatment of Aceto to full CCS. The result is a distributed semantics which yields finite transition systems for all CCS processes with a regular behaviour and a finite degree of parallelism.

16 citations


Proceedings ArticleDOI
01 Sep 1993
TL;DR: The paper discusses the value of abstraction and semantic richness, performance issues, portability, potential degree of parallelism, data distribution, process creation, communication and synchronization, frequency of program faults, and clarity of expression in the BLAS.
Abstract: Although multicomputers are becoming feasible for solving large problems, they are difficult to program: Extraction of parallelism from scalar languages is possible, but limited. Parallelism in algorithm design is difficult for those who think in von Neumann terms. Portability of programs and programming skills can only be achieved by hiding the underlying machine architecture from the user, yet this may impact performance on a specific host.APL, J, and other applicative array languages with adequately rich semantics can do much to solve these problems. The paper discusses the value of abstraction and semantic richness, performance issues, portability, potential degree of parallelism, data distribution, process creation, communication and synchronization, frequency of program faults, and clarity of expression. The BLAS are used as a basis for comparison with traditional supercomputing languages.

13 citations


Proceedings ArticleDOI
01 Dec 1993
TL;DR: Two algorithms developed for a distributed, discretes-event, and object-oriented traffic simulation, such as the Raffle and Highway Objects for REsearch, Analysis, and Understanding (THOREAU), are presented.
Abstract: This paper presents two algorithms developed for a distributed, discretes-event, and object-oriented traffic simulation, such as the Raffle and Highway Objects for REsearch, Analysis, and Understanding (THOREAU) (McGurrin and Wang, 1991) and (Hsin and Wang, 1992). THOREAU was designed for the study and analysis of Intelligent Vehicle Highway Systems (IVHS) [1] applications. The purpose of using distributed processing for traffic simulation is to extend the scope which can be modeled at an individual vehicle behavior level, by significantly increasing execution speed. The first algorithm was derived to decompose a large traffic model into submodels distributed over a network of workstations, with a minimum amount of inter-processor interactions, and to achieve the highest degree of parallelism. The second algorithm is an improvement of the Floyd algorithm for finding shortest paths using submodel decomposition and node to are incidency to achieve a 10m/sup 3/- fold speed improvement using m distributed processors. Both algorithms are being implemented for IVHS-related applications in a new version of THOREAU.

8 citations


Proceedings ArticleDOI
22 Sep 1993
TL;DR: The authors propose a method for deriving parallel, scheduling-optimized protocol implementations from sequential protocol specifications by starting with an SDL specification, identifying a common path for optimization, and performing a data dependency analysis.
Abstract: The authors propose a method for deriving parallel, scheduling-optimized protocol implementations from sequential protocol specifications. They start with an SDL specification, identify a common path for optimization, and perform a data dependency analysis. The resulting common path graph is parallelized as far as permitted by the data dependency graph. The degree of parallelism is extended even further by deferring data operations to the exit nodes of the common path graph. The resulting parallel operation model is then submitted to a scheduling algorithm, yielding an optimized compile-time schedule. An IP-based protocol stack with TCP and FTP as upper layers serves as an example. >

8 citations


Proceedings ArticleDOI
17 Oct 1993
TL;DR: In this work, it is shown how a simple arrangement of FPGAs and memory can be used to synthesize a wide variety of image processing pipelines having different topologies and functionality.
Abstract: This paper discusses the use of restructurable hardware, specifically field programmable gate arrays, in real-time image processing and manipulation tasks such as convolution filtering, scaling and rotation, composition, color space transformation, etc. Each of these functions can be implemented using a customized pipeline design to obtain a high degree of parallelism and thus high performance. In this work, we show how a simple arrangement of FPGAs and memory can be used to synthesize a wide variety of image processing pipelines having different topologies and functionality. >

8 citations


Proceedings ArticleDOI
01 Jun 1993
TL;DR: The aim of demonstrating VODAK Open Nested Transactions is to provide insights into internals of database systems that are usually hidden from application programmers and users, and to increase the degree of parallelism between concurrent transactions compared to conventional transaction management schemes.
Abstract: VODAK is a prototype of an object-oriented, distributed database system developed during the past five years at the Integrated Publication and Information Systems Institute (IPSI). The aim of demonstrating VODAK Open Nested Transactions is to provide insights into internals of database systems that are usually hidden from application programmers and users. By utilizing semantics of methods, VODAK Open Nested Transactions increase the degree of parallelism between concurrent transactions compared to conventional transaction management schemes. Demonstrating the difference in parallelism provides users with a “feeling” for internal database mechanisms, application programmers with information about the impact of transaction management on performance, and system developers with ideas how to improve their systems with respect to transaction management.

6 citations


Journal ArticleDOI
TL;DR: The stages in reconfiguring a computation for parallel execution are described and three novel and useful techniques are presented for analyzing and restructuring programs for execution on varying parallel computer architectures.

Proceedings ArticleDOI
J.N. Coleman1
13 Apr 1993
TL;DR: It is shown that the dataflow processor compares favourably, given a reasonable degree of parallelism in the program, with the measured performance of an advanced von Neumann computer running equivalent code.
Abstract: The Event Processor / 3 is a dataflow processing element designed for high performance over a range of general computing tasks. Using a multithreading technique, program parallelism is exploited by interleaving threads onto successive pipeline stages. It may also be used as an element in a multiprocessor system. This paper describes the philosophy and design of the machine, and presents the results of detailed simulations of the performance of a single processing element. This is analysed into three factors: clock period, cycles per instruction and instructions per program; and each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the dataflow processor compares favourably, given a reasonable degree of parallelism in the program. >

Patent
10 Dec 1993
TL;DR: In this article, the maximum parallelism of (P-1)m by using the memory group 105 having P reading ports was achieved by starting each movement compensated computing element at prescribed timing for reading out data necessary for norm operation from plural parts of memory group.
Abstract: PURPOSE:To improve the degree of parallelism by starting each movement- compensated computing element at prescribed timing for reading out data necessary for norm operation from plural parts of memory group. CONSTITUTION:Picture element data for m lines in the vertical direction are successively read out from the 1st port out of P ports in a searching picture element data memory group 105 while scanning them by using the uppermost left picture element in a searching picture element area as a start point and then data are similarly read out from the 2nd port by using a picture element shifted by m lines in the vertical direction as a start point. In a movement- compensated computing element 107, mXm movement-compensated elements 103 select one port in the memory group 105 and are started at the timing reading out data necessary for norm operation from the port. Consequently, operation can be executed by the maximum parallelism of (P-1).m by using the memory group 105 having P reading ports.

Proceedings ArticleDOI
15 Nov 1993
TL;DR: This paper describes an optoelectronic look-up table configuration based on an array of exclusive-or gates implemented with heterostructure phototransistors (HPT) and vertical cavity surface emitting lasers (VCSEL).
Abstract: Look-up tables, also known as truth tables, have been commonly used in several processing tasks, such as database operations, associative processing, residue arithmetic, cache memories, mathematical function modules, and instruction decoding. In its simplest form, a look-up table contains a list of values that must be compared against an input argument. If the input value is found among the table entries, a match signal is generated. Many variations of this scheme exist, such as input/output tabulation, multiple and/or partial matching capability, content addressable memories, and auto- and heteroassociative memories. The main function of searching the look-up table is a highly parallel process and must be completed, preferably, in a single step. The low computational requirements of the look-up operation, coupled with its large degree of parallelism and tabular representation of data, have led to several implementations of optical look-up table architectures. In this paper, we describe an optoelectronic look-up table configuration based on an array of exclusive-or gates implemented with heterostructure phototransistors (HPT) and vertical cavity surface emitting lasers (VCSEL). >

Proceedings ArticleDOI
01 Jan 1993
TL;DR: A finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer and a high degree of parallelism has been achieved utilizing a MasPar MP-2 SIMD computer.
Abstract: A finite element model is developed and used to simulate three-dimensional compressible fluid flow on a massively parallel computer. The algorithm is based on a Petrov-Galerkin weighting of the convective terms in the governing equations. The discretized time-dependent equations are solved explicitely using a second-order Runge-Kutta scheme. A high degree of parallelism has been achieved utilizing a MasPar MP-2 SIMD computer. An automated conversion program is used to translate the original FORTRAN 77 code into the FORTRAN 90 needed for parallelization. This conversion program and the use of compiler directives allows the maintenance of one version of the code for use on either vector or parallel machines. The performance of the algorithm is illustrated through its application to several example problems; execution times are presented for different computational platforms. 18 refs.

Book ChapterDOI
13 Sep 1993
TL;DR: An autoassociator neural network that can be operated to solve a computation problem with a high degree of parallelism and the set of stable states that build up the attractor are determined by the feedback weights and biases are described.
Abstract: An autoassociator neural network can be operated to solve a computation problem with a high degree of parallelism The set of stable states (solutions of the problem) that build up the attractor of such a recurrent network are determined by the feedback weights and biases This set can be constructed by using the k-out-of-n design rule It is shown how to convert arbitrary boolean expressions into a list of k-out-of-n constraints Finally, a compiler for generating the network structure is described

Journal ArticleDOI
TL;DR: The main feature of the proposed architectures lies on elaborating each subsystem's decision, not only by processing its own local data, but also by adjusting this decision with all other related subsystem's local data to ensure the optimality of the decentralized filter.
Abstract: This paper presents decentralized computational architectures for the optimal state estimation in stochastic large-scale linear systems. The main feature of the proposed architectures lies on elaborating each subsystem's decision, not only by processing its own local data, but also by adjusting this decision with all other related subsystem's local data. This adjustment procedure ensures the optimality of the decentralized filter. It is emphasized that the Kalman filter algorithm operates more efficiently when measurements are processed into low-order subsets, especially when they are processed one at a time. Thus, using this feature in a decentralized scheme significantly increases computational savings and numerical stability. Architectures presented in this paper for the mechanization of decentralized estimators allow a high degree of parallelism and can be implemented on a wide range of computer networks.

01 Jan 1993
TL;DR: Several parallel, multi-wavefront algorithms based on two processing approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries are presented, thus introducing a higher degree of parallelism in query processing.
Abstract: Object-oriented database management systems (OODBMSs) provide rich facilities for the modeling and processing of structural as well as behavioral properties of complex application objects. However, due to their inherent generality and continuously evolving functionalities, efficient implementations are important for these OODBMSs to support the present and future applications, particularly when the databases are very large. In this dissertation, we present several parallel, multi-wavefront algorithms based on two processing approaches, i.e., identification and elimination approaches, to verify association patterns specified in queries. Both approaches allow more processors to operate concurrently on a query than the traditional tree-structured query processing approach, thus introducing a higher degree of parallelism in query processing. A graph model is used to transform the query processing problem into a graph problem. Based on the graph model, proofs of correctness of both approaches for tree-structured queries are given, and a combined approach for solving cyclic queries is also provided and proved. A heuristic method is also presented for partitioning an OODB. The main consideration for partitioning the database is load balancing. This method also tries to reduce the communication time by reducing the length of the path that wavefronts need to be propagated. Multiple wavefront algorithms based on the two approaches for tree-structured queries have been implemented on an nCUBE 2 parallel computer. The implementation of the query processor allows multiple queries to be executed simultaneously. This implementation provides an environment for evaluating the algorithms and the heuristic method for partitioning the database. The evaluation results are presented in this dissertation.

Proceedings ArticleDOI
28 Mar 1993
TL;DR: The algorithm discussed allows the building of hierarchical selective fuzzy systems and can be quite easily implemented in hardware, such as a rule chip able to perform a greater number of rules.
Abstract: In both crisp and fuzzy inference machines, the degree of parallelism required to yield one complete elementary inference, i.e., an inference for one node and one output variable, in one processing step is defined as the dimension of the inference. It is shown that the complexity of the hardware and the complexity of the computation can be substantially decreased by using a selective activation of the inference rules. The algorithm discussed allows the building of hierarchical selective fuzzy systems. The algorithm for selective rule activation is presented for a one-dimensional input space case, i.e., for a single input variable case. The algorithm can be quite easily implemented in hardware, such as a rule chip able to perform a greater number of rules. >

Book ChapterDOI
01 Jan 1993
TL;DR: The Integrated Channel Manager (ICM), an architecture for fast adaptive channel allocation, is an integrated controller connected to the system bus within the network switch that allows an efficient rejection of a call when the call cannot be supported.
Abstract: This paper proposes a hardware solution to the efficient utilization of cellular networks with single-and multi-terminal platforms. In such networks, a mobile platform (e.g., an airplane) can carry more than one wireless terminal. A good utilization of available channels as a shared resource is important for quality and efficient communications in the network. In this paper, we propose the Integrated Channel Manager (ICM), an architecture for fast adaptive channel allocation. It is an integrated controller connected to the system bus within the network switch. Its main advantage is a fast allocation of available channels when a request for a call initialization or a hand-off exists. Its efficiency is achieved via channel allocation functions supported by a hardware with high degree of parallelism. The ICM supports both single and multiple hand-offs. It allows an efficient rejection of a call when the call cannot be supported. Thus, it reduces the processing overhead for rejected calls.

Proceedings ArticleDOI
13 Apr 1993
TL;DR: The authors chose FP as the application language because it is a simple yet expressive language and because FP allows one to create functional forms that yield highly-parallel computations when applied to lists representing matrix or vector data.
Abstract: This paper describes an implementation scheme that maps sequences (lists) in the functional language FP onto a data-parallel SIMD multiprocessor. The mapping is dynamic (i.e., self-organizing at run-time via an atom vector) and is transparent to the programmer. Furthermore, as the problem size and the capability of the architecture increases, the method described will proportionally scale the degree of parallelism. The authors chose FP as the application language because it is a simple yet expressive language and because FP allows one to create functional forms that yield highly-parallel computations when applied to lists representing matrix or vector data. The target architecture is a MasPar MP-1 with 16 K processors. >

Book ChapterDOI
01 Jan 1993
TL;DR: Results are presented for an opto-electronic broadcasting network, which has been evaluated in terms of the possible degree of parallelism, the energy requirements, and the speed of the system.
Abstract: In order to design the very complex systems which occur in optical or optoelectronic interconnection and processing systems computer aided design tools are necessary. There are two main approaches to the design of such systems. One approach emphasizes a hybrid concept, known as smart pixels, in which communication is performed optically and processing is performed electronically. The other approach, known as symbolic substitution logic, tries to eliminate the electronics as far as possible. HADLOP (Hardware Description Logic for Optical Processing) is a software design tool for the modelling, simulation and evaluation of both approaches. In contrast to hardware description languages for pure electronic designs, with HADLOP it is possible to model the two-dimensional nature of optics. HADLOP works at the gate level because systems are described as a sequence of two-dimensional gate layers which are connected with optical connection modules. We present results for an opto-electronic broadcasting network, which has been evaluated in terms of the possible degree of parallelism, the energy requirements, and the speed of the system.

Proceedings ArticleDOI
01 Nov 1993
TL;DR: An approach to implement several time-adaptive Jacobi-type algorithms on a parallel processor array, using only Cordic arithmetic and asynchronous communications, such that any degree of parallelism, ranging from single-processor up to full-size array implementation, is supported by a `universal' processing unit.
Abstract: Implementing Jacobi algorithms in parallel VLSI processor arrays is a non-trivial task, in particular when the algorithms are parametrized with respect to size and the architectures are parametrized with respect to space-time trade-offs. The paper is concerned with an approach to implement several time-adaptive Jacobi-type algorithms on a parallel processor array, using only Cordic arithmetic and asynchronous communications, such that any degree of parallelism, ranging from single-processor up to full-size array implementation, is supported by a `universal' processing unit. This result is attributed to a gracious interplay between algorithmic and architectural engineering.© (1993) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Proceedings ArticleDOI
05 Jan 1993
TL;DR: A fully automatic compilation method for distributed memory machines that produces an efficient program partition without user intervention by targeting Sisal, a parallel functional language on a iPSC/860 multicomputer.
Abstract: A fully automatic compilation method for distributed memory machines is described. It produces an efficient program partition without user intervention. A task based approach is adopted at the intermediate form level allowing a large degree of language and architecture independence. The scheduling phase of the compiler works partially at compile time and partially at run time. At compile time, an infinite number of processors is assumed and the schedule is found by using a new concept of the threshold of a task that quantifies a tradeoff between the schedule-length and the degree of parallelism. At run time, the generated parallel code can be scaled down to match the available processors. This approach is demonstrated by targeting Sisal, a parallel functional language on a iPSC/860 multicomputer. >

Proceedings ArticleDOI
27 May 1993
TL;DR: An allocation algorithm is developed that yields the minimum time cost for a parallel structure that is less than the degree of parallelism of the parallel structure.
Abstract: We develop an allocation algorithm that, in conjunction with optimal scheduling algorithms, yields the minimum time cost for a parallel structure. We assume a shared memory environment and a limited number of processors that is less than the degree of parallelism of the parallel structure. >

Proceedings ArticleDOI
13 Apr 1993
TL;DR: The paper explores the trade-offs between block-row and block-column partitioning schemes for the matrix of constraint coefficients vis-a-vis the communication overheads and granularity of parallel computations.
Abstract: The parallel implementation of the revised simplex algorithm (RSA) using eta-factorization holds the promise of significant improvement in the execution time by virtue of the existence of a high degree of parallelism in the computation within an iteration of the algorithm. However, the scheme employed to partition key data structures in a distributed memory parallel processor has a great impact on the achievable performance. The paper explores the trade-offs between block-row and block-column partitioning schemes for the matrix of constraint coefficients vis-a-vis the communication overheads and granularity of parallel computations. The results of an approximate analysis of the compute-communication balance are compared with measurements from practical implementation of the partitioning schemes on C-DAC's PARAM 8000 distributed memory parallel processor. >

01 Jan 1993
TL;DR: A new language combining the formal description features with the ones of the object approach is developed, and the interest of High-level Petri Nets is pointed out as a semantics for such a language since they provide numerours verification algorithms.
Abstract: The components of many parallel applications (processes, resources, communication links .... ) can be grouped into classes of objects with a similar behaviour. These applications are often described in a generic fashion by specifying the behaviour of an item of each class independantly of its cardinality. This paper handles specification, verification and prototyping methods of these applications. At first, we will outline the drawbacks of the current specification languages and we will develop a new language combining the formal description features with the ones of the object approach. Then we will point out the interest of High-level Petri Nets as a semantics for such a language since they provide numerours verification algorithms. At last, we will overview the prototyping techniques based on High-level Petri Nets which aim at obtaining a maximal degree of parallelism and and fulfilling the requirements of a target architecture.

Book ChapterDOI
01 Jan 1993
TL;DR: Highly parallel computing machines can vary over a wide range of design philosophies, but all depend on some form of space-time concurrency for their potential high-speed computing capacity.
Abstract: Highly parallel computing machines can vary over a wide range of design philosophies, but all depend on some form of space-time concurrency for their potential high-speed computing capacity. Typically, such computers feature a collection of homogeneous processing elements (nodes) together with an interconnection network and can be characterized by the granularity, or power, of the node processors, the degree of parallelism as measured by the number of independent processing elements and the complexity of node coupling, which describes the degree of interaction between nodes.

Book ChapterDOI
01 Jan 1993
TL;DR: This work has shown that it is possible for the first time to construct integrated circuits with as many as a million elements on a chip (in an area of approximately 1 cm2) with high degree of parallelism.
Abstract: With the introduction of VLSI (Very Large Scale Integration) technology [16], it has been possible for the first time to construct integrated circuits with as many as a million elements on a chip (in an area of approximately 1 cm2). The high degree of parallelism that can be achieved in circuits of such density enables computations to be realized on VLSI chips with extreme speed and efficiency.

01 Jan 1993
TL;DR: In this paper, the authors describe an implementation scheme that maps sequences (lists) in the functional language FP onto a data-parallel SIMD multiprocessor, the mapping is dynamic (Le, self-organizing at run-time via an atom vector) and is transparent to the programmer.
Abstract: This paper describes an implementation scheme that maps sequences (lists) in the functional language FP onto a data-parallel SIMD multiprocessor, The mapping is dynamic (Le., self-organizing at run-time via an atom vector) and is transparent to the programmer. Furthermore, as the problem size and the capability of the architecture increases, the method described here will proportionally scale the degree of parallelism. We chose FP as the application language because it is a simple yet expressive language and because FP allows us to create functional forms that yield highly-parallel computations when applied to lists representing matrix or vector data. The target architecture is a MasPar MP1 with 16K processors in parallel in the absence of side effects. Various proposed functional languages and their potential contributions to parallel programming are well documented in the literature. Although the benefits of using functional languages are generally acknowledged, surprisingly few (e.g., [2], I3], [SI) have been implemented on commercial parallel systems. In this paper, we show how inherently dataparallel operations are expressed in a functional language and how these operations can be efficiently implemented on a SIMD machine. We use the functional language FP as our example language and target it for the MasPar massively data-parallel computer. The following discussion emphasizes the implementation of data-parallel operations in FP; a full description of the MasPar and a complete set of results may be found in [4] and [5], respectively.