Topic
Pipeline (computing)
About: Pipeline (computing) is a research topic. Over the lifetime, 26760 publications have been published within this topic receiving 204305 citations. The topic is also known as: data pipeline & computational pipeline.
Papers published on a yearly basis
Papers
More filters
•
07 Jun 2004
TL;DR: A distributed query engine pipeline architecture comprises cascaded analysis engines that accept an input query and each identifies a portion of the input query that it can pass on to an execution engine as discussed by the authors.
Abstract: A distributed query engine pipeline architecture comprises cascaded analysis engines that accept an input query and each identifies a portion of the input query that it can pass on to an execution engine. Each stage rewrites the input query to remove the portion identified and replaces it with a placeholder. The rewritten query is forwarded to the next analysis engine in the cascade. Each engine compiles the portion it identified so that an execution engine may process that portion. Execution preferably proceeds from the portion of the query compiled by the last analysis engine. The execution engine corresponding to the last analysis engine executes the query and makes a call to the next higher execution engine in the cascade for data from the preceding portion. The process continues until the results from the input query are fully assembled.
113 citations
••
TL;DR: In this article, the effect of corrosion defect size on the remaining pipeline strength is modeled by a Markov process Analytical solution of the probability transition matrix is obtained by solving the Kolmogorov forward differential equation.
113 citations
••
TL;DR: The PA7100 CPU, the first precision-architecture, reduced-instruction-set-computer (PA-RISC) architecture implementation to combine an integer core and floating-point coprocessor into a single-chip format, is described.
Abstract: The PA7100 CPU, the first precision-architecture, reduced-instruction-set-computer (PA-RISC) architecture implementation to combine an integer core and floating-point coprocessor into a single-chip format, is described. It incorporates superscalar execution and supports clock rates of up to 100 MHz in standard 0.8- mu m CMOS. Features such as a flexible primary cache organization and multiprocessing capability allow the device to be scaled to a variety of system applications, price ranges, and performance levels. The microprocessor instruction execution pipeline, cache design, translation look-aside buffer (TLB) for virtual address translation, floating-point unit, and system interface bus are discussed. The design, test, and verification methods used in the development of the PA7100 are reviewed. >
113 citations
••
04 Apr 2019TL;DR: This work proposes dataflow optimizations to address the shortcomings of existing parallel dataflow techniques for tiled NN accelerators, and develops buffer sharing dataflow that turns the distributed buffers into an idealized shared buffer, eliminating excessive data duplication and the memory access overheads.
Abstract: The use of increasingly larger and more complex neural networks (NNs) makes it critical to scale the capabilities and efficiency of NN accelerators. Tiled architectures provide an intuitive scaling solution that supports both coarse-grained parallelism in NNs: intra-layer parallelism, where all tiles process a single layer, and inter-layer pipelining, where multiple layers execute across tiles in a pipelined manner. This work proposes dataflow optimizations to address the shortcomings of existing parallel dataflow techniques for tiled NN accelerators. For intra-layer parallelism, we develop buffer sharing dataflow that turns the distributed buffers into an idealized shared buffer, eliminating excessive data duplication and the memory access overheads. For inter-layer pipelining, we develop alternate layer loop ordering that forwards the intermediate data in a more fine-grained and timely manner, reducing the buffer requirements and pipeline delays. We also make inter-layer pipelining applicable to NNs with complex DAG structures. These optimizations improve the performance of tiled NN accelerators by 2x and reduce their energy consumption by 45% across a wide range of NNs. The effectiveness of our optimizations also increases with the NN size and complexity.
113 citations
••
27 Aug 2013TL;DR: BigStation is presented, a scalable architecture that enables realtime signal processing in large-scale MIMO systems which may have tens or hundreds of antennas and parallelize the MU-MIMO processing with a distributed pipeline based on its computation and communication patterns.
Abstract: Multi-user multiple-input multiple-output (MU-MIMO) is the latest communication technology that promises to linearly increase the wireless capacity by deploying more antennas on access points (APs). However, the large number of MIMO antennas will generate a huge amount of digital signal samples in real time. This imposes a grand challenge on the AP design by multiplying the computation and the I/O requirements to process the digital samples. This paper presents BigStation, a scalable architecture that enables realtime signal processing in large-scale MIMO systems which may have tens or hundreds of antennas. Our strategy to scale is to extensively parallelize the MU-MIMO processing on many simple and low-cost commodity computing devices. Our design can incrementally support more antennas by proportionally adding more computing devices. To reduce the overall processing latency, which is a critical constraint for wireless communication, we parallelize the MU-MIMO processing with a distributed pipeline based on its computation and communication patterns. At each stage of the pipeline, we further use data partitioning and computation partitioning to increase the processing speed. As a proof of concept, we have built a BigStation prototype based on commodity PC servers and standard Ethernet switches. Our prototype employs 15 PC servers and can support real-time processing of 12 software radio antennas. Our results show that the BigStation architecture is able to scale to tens to hundreds of antennas. With 12 antennas, our BigStation prototype can increase wireless capacity by 6.8x with a low mean processing delay of 860μs. While this latency is not yet low enough for the 802.11 MAC, it already satisfies the real-time requirements of many existing wireless standards, e.g., LTE and WCDMA.
112 citations