The OpenCL standard offers a common API for program execution on systems composed of different types of computational devices such as multicore CPUs, GPUs, or other accelerators.

/pdf/opencl-a-parallel-programming-standard-for-heterogeneous-508iki5i96.pdf

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems

The CIPRES Science Gateway (CSG) provides researchers and educators with browser-based access to community codes for inference of phylogenetic relationships from DNA and protein sequence data. The CSG allows users to deploy jobs on the high-performance computers of the TeraGrid without requiring detailed knowledge of their complexities. Use of the CSG has grown rapidly; through March 2011 it had more than 2,200 users and enabled more than 180 peer-reviewed publications. The rapid growth in resource consumption was accommodated by deploying codes on Trestles, a new TeraGrid computer. Tools and policies were developed to insure efficient and effective resource use. This paper describes progress in managing the growth of this public cyberinfrastructure resource and reviews the domain science that it has enabled.

The CIPRES science gateway: a community resource for phylogenetic analyses

The nematode Caenorhabditis elegans is a major laboratory model in biology. Only ten Caenorhabditis species were available in culture at the onset of this study. Many of them, like C. elegans, were mostly isolated from artificial compost heaps, and their more natural habitat was unknown. Caenorhabditis nematodes were found to be proliferating in rotten fruits, flowers and stems. By collecting a large worldwide set of such samples, 16 new Caenorhabditis species were discovered. We performed mating tests to establish biological species status and found some instances of semi-fertile or sterile hybrid progeny. We established barcodes for all species using ITS2 rDNA sequences. By obtaining sequence data for two rRNA and nine protein-coding genes, we determined the likely phylogenetic relationships among the 26 species in culture. The new species are part of two well-resolved sister clades that we call the Elegans super-group and the Drosophilae super-group. We further scored phenotypic characters such as reproductive mode, mating behavior and male tail morphology, and discuss their congruence with the phylogeny. A small space between rays 2 and 3 evolved once in the stem species of the Elegans super-group; a narrow fan and spiral copulation evolved once in the stem species of C. angaria, C. sp. 8 and C. sp. 12. Several other character changes occurred convergently. For example, hermaphroditism evolved three times independently in C. elegans, C. briggsae and C. sp. 11. Several species can co-occur in the same location or even the same fruit. At the global level, some species have a cosmopolitan distribution: C. briggsae is particularly widespread, while C. elegans and C. remanei are found mostly or exclusively in temperate regions, and C. brenneri and C. sp. 11 exclusively in tropical zones. Other species have limited distributions, for example C. sp. 5 appears to be restricted to China, C. sp. 7 to West Africa and C. sp. 8 to the Eastern United States. Caenorhabditis are "fruit worms", not soil nematodes. The 16 new species provide a resource and their phylogeny offers a framework for further studies into the evolution of genomic and phenotypic characters.

/pdf/a-phylogeny-and-molecular-barcodes-for-caenorhabditis-with-5g9glc41ik.pdf

A phylogeny and molecular barcodes for Caenorhabditis, with numerous new species from rotting fruits

Phylogeny of Eutardigrada: new molecular data and their morphological support lead to the identification of new evolutionary lineages.

An electronic device and a control method thereof includes an infrared module which receives an infrared control signal having a predetermined section-specific waveform from an infrared transmitter, a switching unit which turns on/off the infrared module, and a controller which controls the switching unit to alternately turn on and off the infrared module in a predetermined cycle for recognizing the predetermined section-specific waveform of the infrared control signal. Thus, electric power consumed by an infrared module in a standby mode may be reduced while receiving a signal of an infrared transmitter.

Electronic device and control method thereof

The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization. The proposed model was experimentally verified for different architectures by taking advantage of built-in hardware counters with a curve fitness above 90%.

Cache-aware Roofline model: Upgrading the loft

A processor includes a processor core and a calculation circuit. The processor core includes logic determine a set of weights for use in a convolutional neural network (CNN) calculation and scale up the weights using a scale value. The calculation circuit includes logic to receive the scale value, the set of weights, and a set of input values, wherein each input value and associated weight of a same fixed size. The calculation circuit also includes logic to determine results from convolutional neural network (CNN) calculations based upon the set of weights applied to the set of input values, scale down the results using the scale value, truncate the scaled down results to the fixed size, and communicatively couple the truncated results to an output for a layer of the CNN.

Weight-shifting mechanism for convolutional neural networks

We are currently faced with the situation where applications have increasing computational demands and there is a wide selection of parallel processor systems. In this paper we focus on exploiting fine-grain parallelism for a demanding Bioinformatics application - MrBayes - and its Phylogenetic Likelihood Functions (PLF) using different architectures. Our experiments compare side-by-side the scalability and performance achieved using general-purpose multi-core processors, the Cell/BE, and Graphics Processor Units (GPU). The results indicate that all processors scale well for larger computation and data sets. Also, GPU and Cell/BE processors achieve the best improvement for the parallel code section. Nevertheless, data transfers and the execution of the serial portion of the code are the reasons for their poor overall performance. The general-purpose multi-core processors prove to be simpler to program and provide the best balance between an efficient parallel and serial execution, resulting in the largest speedup.

/pdf/fine-grain-parallelism-using-multi-core-cell-be-and-gpu-56ykonh7xs.pdf

Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function

A storage device and method are described for performing convolution operations. For example, one embodiment of an apparatus to perform convolution operations comprises a plurality of processing units to execute convolution operations on input data and partial results; a unified scratchpad memory comprising a plurality of memory banks communicatively coupled to the plurality of processing units through a plurality of read/write ports, each of the plurality of memory banks partitioned to store both the input data and partial results; a control unit to allocate the input data and partial results to the memory banks to ensure a minimum quality of service in accordance with the specified number of read/write ports and the specified convolution operation to be performed.

Storage device and method for performing convolution operations

An apparatus and method are described for distributed and cooperative computation in artificial neural networks. For example, one embodiment of an apparatus comprises: an input/output (I/O) interface; a plurality of processing units communicatively coupled to the I/O interface to receive data for input neurons and synaptic weights associated with each of the input neurons, each of the plurality of processing units to process at least a portion of the data for the input neurons and synaptic weights to generate partial results; and an interconnect communicatively coupling the plurality of processing units, each of the processing units to share the partial results with one or more other processing units over the interconnect, the other processing units using the partial results to generate additional partial results or final results. The processing units may share data including input neurons and weights over the shared input bus.

Frederico Pratas

Papers

Cache-aware Roofline model: Upgrading the loft

Weight-shifting mechanism for convolutional neural networks

Fine-grain Parallelism Using Multi-core, Cell/BE, and GPU Systems: Accelerating the Phylogenetic Likelihood Function

Storage device and method for performing convolution operations

Method and apparatus for distributed and cooperative computation in artificial neural networks