Some embodiments provide an IC for implementing a machine-trained network with multiple layers. The IC includes a set of circuits to compute a dot product of (i) a first number of input values computed by other circuits of the IC and (ii) a set of predefined weight values, several of which are zero, with a weight value for each of the input values. The set of circuits includes (i) a dot product computation circuit to compute the dot product based on a second number of inputs and (ii) for each input value, at least two sets of wires for providing the input value to at least two of the dot product computation circuit inputs. The second number is less than the first number. Each input value with a corresponding weight value that is not equal to zero is provided to a different one of the dot product computation circuit inputs.

Reduced dot product computation circuit

A processor is provided. The processor includes a plurality of processing elements configured to be arranged in a matrix form, and a controller configured to control the plurality of processing elements during a plurality of cycles to process a target data, control first processing elements so that each of the first processing elements operates data provided from adjacent first processing elements and the input first element and inputs each of second elements included in a second row among the plurality of elements to second processing elements arranged in the second row among the plurality of processing elements, control the second processing elements so that each of the second processing elements operates data provided from adjacent second processing elements and the input second element, and operates data provided from the adjacent first processing elements in the same column among the first processing elements and pre-stored operation data.

Processor and control methods thereof

The invention provides an edge computing task allocation method based on a deep neural network, which comprises the following steps of: obtaining parameter quantity data, and respectively computing parameter quantities of a network layer to be computed in the neural network; obtaining calculated amount data, and obtaining the calculated amount data according to the parameter amount of the networklayer to be calculated; allocating calculation tasks, and obtaining the calculation tasks of the terminal equipment according to the calculation amount data; obtaining a computing task of the edge server according to the computing task of the terminal equipment; and judging whether the remaining computing tasks need to be executed in the cloud server. Besides, the invention further provides an edge computing task allocation device based on the deep neural network and a storage medium, the real-time residual computing resource condition of each layer of equipment can be fully considered, the parameter quantity and the computing quantity of each layer are calculated on the basis, a corresponding deployment scheme is obtained, and the computing capacity of each layer of equipment is fully utilized.

Edge computing task allocation method and device based on deep neural network

A multi-threaded programming language and compiler generates synchronous digital circuits that maintain thread execution order by generating pipelines with code paths that have the same number of stages. The compiler balances related code paths within a pipeline by adding additional stages to a code path that has fewer stages. Programming constructs that, by design, allow thread execution to be re-ordered, may be placed in a reorder block construct that releases threads in the order they entered the programming construct. First-in-first-out (FIFO) queues pass local variables between pipelines. Local variables are popped from FIFOs in the order they were pushed, preserving thread execution order across pipelines.

Language and compiler that generate synchronous digital circuits that maintain thread execution order

Disclosed is a data accelerated processing system including a processing device, a storage device, an interface device and a control device. The processing device is configured to realize accelerated operation processing of data. The storage device is electrically connected to the processing device for storing the data sent by a server. The interface device is electrically connected to the processing device for data transmission between the processing device and the server. The control device is configured to regulate the status of the processing device. During an operation process, a large number of operating tasks in the server may be transmitted to the processing device for operating through the interface device, and large amounts of buffered data may be stored in the storage device. The data accelerated processing system improves data reading speed and operation efficiency through the cooperation of the processing device, the storage device and the interface device.

Data accelerated processing system

An example preprocessor circuit for formatting image data into a plurality of streams of image samples includes: a first buffer configured to store a plurality of rows of the image data and output a row of the plurality of rows; a second buffer, coupled to the first buffer, including a plurality of storage locations to store a respective plurality of image samples of the row output by the first buffer; a plurality of shift registers; an interconnect network including a plurality of connections, each connection coupling a respective one of the plurality of shift registers to more than one of the plurality of storage locations, one or more of the plurality of storage locations being coupled to more than one of the plurality of connections; and a control circuit configured to load the plurality of shift registers with the plurality of image samples based on the plurality of connections and shift the plurality of shift registers to output the plurality of streams of image samples.

Image preprocessing for generalized image processing

Embodiments herein describe techniques for interfacing a neural network application (120) with a neural network accelerator (165) using a library (130). The neural network application (120) may execute on a host computing system (105) while the neural network accelerator (165) executes on a massively parallel hardware system, e.g., a FPGA (150). The library (130) operates a pipeline (500) for submitting the tasks received from the neural network application (120) to the neural network accelerator (165). In one embodiment, the pipeline (500) includes a pre-processing stage, an FPGA execution stage, and a post-processing stage (135) which each correspond to different threads. When receiving a task from the neural network application (120), the library (130) generates a packet (410) that includes the information required for the different stages in the pipeline to perform the tasks. Because the stages correspond to different threads (415), the library (130) can process multiple packets in parallel which can increase the utilization of the neural network accelerator (165) on the hardware system.

Machine learning runtime library for neural network acceleration

Methods and apparatus are described for partitioning and reordering block-based matrix multiplications for high-speed data streaming in general matrix multiplication (GEMM), which may be implemented by a programmable integrated circuit (IC). By preloading and hierarchically caching the blocks, examples of the present disclosure reduce the double data rate (DDR) memory intake bandwidth for software-defined GEMM accelerators.

Software-defined memory bandwidth reduction by hierarchical stream buffering for general matrix multiplication in a programmable IC

Embodiments herein describe techniques for static scheduling a neural network (100) implemented in a massively parallel hardware system (205). The neural network (100) may be scheduled using three different scheduling levels referred to herein as an upper level, an intermediate level, and a lower level. In one embodiment, the upper level includes a hardware or software model (400) of the layers in the neural network (100) that establishes a sequential order of functions that operate concurrently in the hardware system (205). In the intermediate level, identical processes in the functions defined in the upper level are connected to form a systolic array (280) or mesh and balanced data flow channels are used to minimize latency. In the lower level, a compiler (265) can assign the operations performed by the processing elements in the systolic array to different portions of the hardware system (205) to provide a static schedule for the neural network (100).

Static block scheduling in massively parallel software defined hardware systems

Methods and apparatus are described for performing data-intensive compute algorithms, such as fast massively parallel general matrix multiplication (GEMM), using a particular data format for both storing data to and reading data from memory. This data format may be utilized for arbitrarily-sized input matrices for GEMM implemented on a finite-size GEMM accelerator in the form of a rectangular compute array of digital signal processing (DSP) elements or similar compute cores. This data format solves the issue of double data rate (DDR) dynamic random access memory (DRAM) bandwidth by allowing both linear DDR addressing and single cycle loading of data into the compute array, avoiding input/output (I/O) and/or DDR bottlenecks.

Zejda Jindrich

Papers

Image preprocessing for generalized image processing

Machine learning runtime library for neural network acceleration

Software-defined memory bandwidth reduction by hierarchical stream buffering for general matrix multiplication in a programmable IC

Static block scheduling in massively parallel software defined hardware systems

Data format suitable for fast massively parallel general matrix multiplication in a programmable IC