scispace - formally typeset
Search or ask a question

Showing papers on "Bitonic sorter published in 2017"


Proceedings ArticleDOI
03 Jul 2017
TL;DR: This work provides an extensive model of all memory configuration options for Xeon Phi KNL and demonstrates how it can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively.
Abstract: Increasingly complex memory systems and onchip interconnects are developed to mitigate the data movement bottlenecks in manycore processors. One example of such a complex system is the Xeon Phi KNL CPU with three different types of memory, fifteen memory configuration options, and a complex on-chip mesh network connecting up to 72 cores. Users require a detailed understanding of the performance characteristics of the different options to utilize the system efficiently. Unfortunately, peak performance is rarely achievable and achievable performance is hardly documented. We address this with capability models of the memory subsystem, derived by systematic measurements, to guide users to navigate the complex optimization space. As a case study, we provide an extensive model of all memory configuration options for Xeon Phi KNL. We demonstrate how our capability model can be used to automatically derive new close-to-optimal algorithms for various communication functions yielding improvements 5x and 24x over Intel’s tuned OpenMP and MPI implementations, respectively. Furthermore, we demonstrate how to use the models to assess how efficiently a bitonic sort application utilizes the memory resources. Interestingly, our capability models predict and explain that the high bandwidthMCDRAM does not improve the bitonic sort performance over DRAM.

38 citations


Proceedings ArticleDOI
01 Oct 2017
TL;DR: This work presents two very different FPGA implementations of a database join operation, one using a direct O(n2) algorithm, and the other using a bitonic sort network to speed up the join operation.
Abstract: The growing trend toward heterogeneous platforms is crucial to meet time and power consumption constraints for high-performance computing applications. The OpenCL parallel programming language and framework enable programming CPU, GPU and recently FPGAs using the same source code. This eases software developers to implement applications on various devices supported by heterogeneous HPC platforms. This work presents two very different FPGA implementations of a database join operation, one using a direct O(n2) algorithm, and the other using a bitonic sort network to speed up the join operation. Comparison of performance and energy consumption for both FPGA and GPUs is provided which suggests a 40% performance/watt improvement by using an FPGA instead of a GPU.

12 citations



Journal ArticleDOI
TL;DR: A new sorting network on 24 channels is presented, which uses only 12 layers, improving the previously best known bound by one layer, and implies improved sorting networks for 23 channels.

3 citations


Journal ArticleDOI
01 Jan 2017
TL;DR: A method applying approaches of parallelization and asynchronization to a sorting algorithm based on a division into a set of independent adjacent pairs of numbers and their parallel and asynchronous comparison is considered.
Abstract: Nowadays the tasks of computations speed-up and/or their optimization are actual. Among the approaches on how to solve these tasks, a method applying approaches of parallelization and asynchronization to a sorting algorithm is considered in the paper. The sorting methods are ones of elementary methods and they are used in a huge amount of different applications. In the paper, we offer a method of an array sorting that based on a division into a set of independent adjacent pairs of numbers and their parallel and asynchronous comparison. And this one distinguishes the offered method from the traditional sorting algorithms (like quick sorting, merge sorting, insertion sorting and others). The algorithm is implemented with the use of Petri nets, like the most suitable tool for an asynchronous systems description.

3 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: 2 new sorting schemes namely quick select (QS) based selection algorithm and simplified bitonic sorter (SBT) are proposed, which exploit the special data dependency of path metrics in log-likelihood ratio based SCL decoding.
Abstract: Path metric sorting unit of successive cancellation list (SCL) decoders for polar codes is the main concern in this paper. After reviewing existing sorting units in SCL decoders, we propose 2 new sorting schemes namely quick select (QS) based selection algorithm and simplified bitonic sorter (SBT), which exploit the special data dependency of path metrics in log-likelihood ratio based SCL decoding. Theoretical analysis shows that for the list size of L ≤ 8, QS-based selection algorithm has lower delay than existing schemes. FPGA implementation based on Artix7 Family shows that for the list size of L ≥ 16, SBT has the same delay while the hardware reduction is over 40%.

3 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: A CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies and significantly reducing the latency from 100+ microseconds to 2 microseconds is proposed, gaining a speedup of 50x.
Abstract: The financial market server in exchanges aims to maintain the order books and provide real time market data feeds to traders. Low-latency processing is in a great demand in financial trading. Although software solutions provide the flexibility to express algorithms in high-level programming models and to recompile quickly, it is becoming increasingly uncompetitive due to the long and unpredictable response time. Nowadays, Field Programmable Gate Arrays (FPGAs) have been proved to be an established technology for achieving a low and constant latency for processing streaming packets in a hardware accelerated way. However, maintaining order books on FPGAs involves organizing packets into GBs of structural data information as well as complicated routines (sort, insertion, deletion, etc.), which is extremely challenging to FPGA designs in both design methodology and memory volume. Thus existing FPGA designs often leave the post-processing part to the CPUs. However, it largely cancels the latency gain of the network packet processing part. This paper proposes a CPU-FPGA hybrid list design to accelerate financial market servers that achieve microsecond-level latencies. This paper mainly includes four contributions. First, we design a CPU-FPGA hybrid list with two levels, a small cache list on the FPGA and a large master list at the CPU host. Both lists are sorted with different sorting schemes, where the bitonic sort is applied to the cache list while a balanced tree is used to maintain the master list. Second, in order to effectively update the hybrid sorted list, we derive a complete set of low-latency routines, including insertion, deletion, selection, sorting, etc., providing a low latency at the scale of a few cycles. Third, we propose a non-blocking on-demand synchronization strategy for the cache list and the master list to communicate with each other. Lastly, we integrate the hybrid list as well as other components, such as packets splitting, parsing, processing, etc. to form an industry-level financial market server. Our design is applied in the environment of the China Financial Futures Exchange (CFFEX), demonstrating its functionality and stability by running 600+ hours with hundreds of millions packets per day. Compared with the existing CPU-based solution in CFFEX, our system is able to support identical functionalities while significantly reducing the latency from 100+ microseconds to 2 microseconds, gaining a speedup of 50x.

1 citations


Patent
19 Jun 2017
TL;DR: In this paper, a bitonic network including first switches and configured to receive a first randomly ordered list and random switch settings, determine a permutation of the first random ordered list using the first switches, where the permutation includes a second randomly ordered lists, and output the second ordered list.
Abstract: Systems and methods for determining a cumulative control state for mapping logical block addresses (LBAs) to physical block addresses (PBAs) are disclosed. One such system includes a bitonic network including first switches and configured to receive a first randomly ordered list and random switch settings, determine a permutation of the first randomly ordered list using the random switch settings at the first switches, where the permutation includes a second randomly ordered list, and output the second randomly ordered list; a bitonic sorter including second switches and configured to receive the second randomly ordered list, sort the second randomly ordered list, and output settings of the second switches used to achieve the sort, where the second switch settings define a cumulative control state; and an access network configured to determine a PBA of a non-volatile memory (NVM) to enable a data access of a corresponding LBA using the cumulative control state.

1 citations


Proceedings ArticleDOI
01 May 2017
TL;DR: A recursive, odd-even transposition sorter based vector quantizer which is used in mismatch shaping algorithms is presented and its area efficient and speed and area results indicate that the proposed algorithm sorts 32 inputs at a 42% faster rate by using 14% fewer components than the perfect shuffle sorter and a 80% slower rates by using 27% fewer component than the Bitonic sorter.
Abstract: A recursive, odd-even transposition sorter based vector quantizer which is used in mismatch shaping algorithms is presented. Although recursive parallel sorting algorithms require less area than fully parallel sorting algorithms, they are slower than fully parallel algorithms. A widely used recursive parallel sorting algorithm is the perfect shuffle which requires multiple clock cycles to shuffle and sort the data. The proposed recursive algorithm uses fewer clock cycles than the perfect shuffle to sort less than 80 inputs. An area efficient version is also proposed to sort less than 16 inputs faster than perfect shuffle algorithm. To compare the performance of various sorting algorithms suitable for vector quantizer, they are realized and synthesized in TSMC 40nm low-power technology. Speed and area results indicate that the proposed algorithm sorts 32 inputs at a 42% faster rate by using 14% fewer components than the perfect shuffle sorter and a 80% slower rate by using 27% fewer components than the Bitonic sorter. The area efficient version sorts 32 inputs at a 21% slower rate by using 32% fewer components than the perfect shuffle sorter.