scispace - formally typeset
Search or ask a question
Journal Article•DOI•

AltiVec extension to PowerPC accelerates media processing

01 Mar 2000-IEEE Micro (IEEE Computer Society Press)-Vol. 20, Iss: 2, pp 85-95
TL;DR: PowerPC's AltiVec speeds not only media processing but also nearly any application in which data parallelism exists, as demonstrated by a cycle-accurate simulation of Motorola's MPC 7400, the heart of Apple G4 systems.
Abstract: There is a clear trend in personal computing toward multimedia-rich applications. These applications will incorporate a wide variety of multimedia technologies, including audio and video compression, 2D image processing, 3D graphics, speech and handwriting recognition, media mining, and narrow/broadband signal processing for communication. In response to this demand, major microprocessor vendors have announced architectural extensions to their general-purpose processors in an effort to improve their multimedia performance. Intel extended IA-32 with MMX and SSE (alias KNI), Sun enhanced Sparc with VIS, Hewlett-Packard added MAX to its PA-RISC architecture, Silicon Graphics extended the MIPS architecture with MDMX, and Digital (now Compaq) added MVI to Alpha. This article describes the most recent, and what we believe to be the most comprehensive, addition to this list: PowerPC's AltiVec, AltiVec speeds not only media processing but also nearly any application in which data parallelism exists, as demonstrated by a cycle-accurate simulation of Motorola's MPC 7400, the heart of Apple G4 systems.
Citations
More filters
01 Jan 2014
TL;DR: This draft specification may change before being accepted as standard by the RISC-V Foundation, and it remains possible that implementations made to this draft specification will not conform to the future standard.
Abstract: Volume II: Privileged Architecture Privileged Architecture Version 1.10 Document Version 1.10 Warning! This draft specification may change before being accepted as standard by the RISC-V Foundation. While the editors intend future changes to this specification to be forward compatible, it remains possible that implementations made to this draft specification will not conform to the future standard.

583 citations

Journal Article•DOI•
TL;DR: The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations.
Abstract: Eight synergistic processor units enable the Cell Broadband Engine's breakthrough performance. The SPU architecture implements a novel, pervasively data-parallel architecture combining scalar and SIMD processing on a wide data path. A large number of SPUs per chip provide high thread-level parallelism. The streamlined architecture provides an efficient multithreaded execution environment for both scalar and SIMD threads and represents a reaffirmation of the RISC principles of combining leading edge architecture and compiler optimizations. These design decisions have enabled the Cell BE to deliver unprecedented supercomputer-class compute power for consumer applications

463 citations

Proceedings Article•DOI•
14 May 2007
TL;DR: It is found that the discriminatively trained CRF performs as well as or better than an HMM even when the model features do not violate the independence assumptions of the HMM, and it is confirmed that CRFs are robust against any degradation in performance.
Abstract: Activity recognition is a key component for creating intelligent, multi-agent systems. Intrinsically, activity recognition is a temporal classification problem. In this paper, we compare two models for temporal classification: hidden Markov models (HMMs), which have long been applied to the activity recognition problem, and conditional random fields (CRFs). CRFs are discriminative models for labeling sequences. They condition on the entire observation sequence, which avoids the need for independence assumptions between observations. Conditioning on the observations vastly expands the set of features that can be incorporated into the model without violating its assumptions. Using data from a simulated robot tag domain, chosen because it is multi-agent and produces complex interactions between observations, we explore the differences in performance between the discriminatively trained CRF and the generative HMM. Additionally, we examine the effect of incorporating features which violate independence assumptions between observations; such features are typically necessary for high classification accuracy. We find that the discriminatively trained CRF performs as well as or better than an HMM even when the model features do not violate the independence assumptions of the HMM. In cases where features depend on observations from many time steps, we confirm that CRFs are robust against any degradation in performance.

377 citations


Cites methods from "AltiVec extension to PowerPC accele..."

  • ...field heuristic was carefully optimized using a profiler and uses single-instruction-multipledata (SIMD) vector instructions from either the AltiVec [22] or SSE [93] CPU extensions to compute the approximate gains....

    [...]

Book Chapter•DOI•
Atri Rudra, Pradeep Dubey, C. S. Jutla1, Vijay Kumar, Josyula R. Rao1, Pankaj Rohatgi1 •
14 May 2001
TL;DR: This work explores the use of subfield arithmetic for efficient implementations of Galois Field arithmetic especially in the context of the Rijndael block cipher and describes how to select a representation which minimizes the computation cost of the relevant arithmetic.
Abstract: We explore the use of subfield arithmetic for efficient implementations of Galois Field arithmetic especially in the context of the Rijndael block cipher. Our technique involves mapping field elements to a composite field representation. We describe how to select a representation which minimizes the computation cost of the relevant arithmetic, taking into account the cost of the mapping as well. Our method results in a very compact and fast gate circuit for Rijndael encryption. In conjunction with bit-slicing techniques applied to newly proposed parallelizable modes of operation, our circuit leads to a high-performance software implementation for Rijndael encryption which offers significant speedup compared to previously reported implementations.

290 citations

Journal Article•DOI•
01 May 2006
TL;DR: This paper presents a design study for a fully programmable architecture, SODA, that supports software defined radio - a high-end signal processing application and shows that a four processor design is capable of meeting the throughput requirements of the W-CDMA and 802.11a protocols, while operating within the strict power constraints of a mobile terminal.
Abstract: The physical layer of most wireless protocols is traditionally implemented in custom hardware to satisfy the heavy computational requirements while keeping power consumption to a minimum. These implementations are time consuming to design and difficult to verify. A programmable hardware platform capable of supporting software implementations of the physical layer, or software defined radio, has a number of advantages. These include support for multiple protocols, faster time-to-market, higher chip volumes, and support for late implementation changes. The challenge is to achieve this without sacrificing power. In this paper, we present a design study for a fully programmable architecture, SODA, that supports software defined radio a high-end signal processing application. Our design achieves high performance, energy efficiency, and programmability through a combination of features that include single-instruction multiple- data (SIMD) parallelism, and hardware optimized for 16bit computations. The basic processing element is an asymmetric processor consisting of a scalar and SIMD pipeline, and a set of distributed scratchpad memories that are fully managed in software. Results show that a four processor design is capable of meeting the throughput requirements of theW-CDMA and 802.11a protocols, while operating within the strict power constraints of a mobile terminal.

257 citations

References
More filters
Journal Article•DOI•
Alexander D. Peleg1, Uri Weiser1•
TL;DR: MMX technology extends the Intel architecture to improve the performance of multimedia, communications, and other numeric-intensive applications by introducing data types and instructions to the IA that exploit the parallelism in these applications.
Abstract: Designed to accelerate multimedia and communications software, MMX technology improves performance by introducing data types and instructions to the IA that exploit the parallelism in these applications. MMX technology extends the Intel architecture (IA) to improve the performance of multimedia, communications, and other numeric-intensive applications. It uses a SIMD (single-instruction, multiple-data) technique to exploit the parallelism inherent in many algorithms, producing full application performance of 1.5 to 2 times faster than the same applications run on the same processor without MMX. The extension also maintains full compatibility with existing IA microprocessors, operating systems, and applications while providing new instructions and data types that applications can use to achieve a higher level of performance on the host CPU.

552 citations

Journal Article•DOI•
Ruby B. Lee1•
TL;DR: It is proposed that subword parallelism-parallel computation on lower precision data packed into a word-is an efficient and effective solution for accelerating media processing.
Abstract: MAX-2 illustrates how a small set of instruction extensions can provide subword parallelism to accelerate media processing and other data-parallel programs. This article proposes that subword parallelism-parallel computation on lower precision data packed into a word-is an efficient and effective solution for accelerating media processing. As an example, it describes MAX-2, a very lean, RISC-like set of media acceleration primitives included in the 64-bit PA-RISC 2.0 architecture. Because MAX-2 strives to be a minimal set of instructions, the article discusses both instructions included and excluded. Several examples illustrate the use of MAX-2 instructions, which provide subword parallelism in a word-oriented general-purpose processor at essentially no incremental cost.

382 citations

Journal Article•DOI•
TL;DR: UltraSparc's Visual Instruction Set, described here in detail, accelerates some widely used media-processing algorithms by as much as seven times.
Abstract: UltraSparc's Visual Instruction Set, described here in detail, accelerates some widely used media-processing algorithms by as much as seven times. Today's new media, increasingly sophisticated 3D graphics environments, videoconferencing, MPEG video playback, 3D visualization, image processing, and so on, demand enhancements to conventional RISC instruction sets, which were not originally designed to handle such applications. The Visual Instruction Set (VIS) is a comprehensive set of RISC-style instructions targeted at accelerating this new media processing.

266 citations

Journal Article•DOI•
K. Diefendorff1, Pradeep Dubey2•
TL;DR: The authors predict high-performance, general-purpose processors will incorporate more media processing capabilities, eventually bringing about the demise of specialized media processors, except perhaps, in embedded applications.
Abstract: Workloads drive architecture design and will change in the next two decades. For high-performance, general-purpose processors, there is a consensus that multimedia will continue to grow in importance. The authors predict these processors will incorporate more media processing capabilities, eventually bringing about the demise of specialized media processors, except perhaps, in embedded applications. These enhanced general-purpose processor capabilities will arise from multimedia applications that require real-time response, continuous-media data types and significant fine-grained data parallelism.

231 citations