Due to the emerging deep learning technology, deep learning tasks evolve toward complex control logic and frequent host-device interaction. However, the conventional CPU-centric heterogenous system has high host-device interaction overhead and slow improvement in interaction speed. Eventually, this leads to the problem of “interaction wall”, a gap between the improvement of interaction speed and the improvement of device computing speed. According to Amdahl’s law, this would severely limit the application of accelerators.

Addressing this problem, the CPULESS accelerator proposes a fused pipeline structure, which makes deep learning processor the center of the system, eliminating the need for an discrete host CPU chip. With the exception-oriented programming, the DLPU-centric system can combine the scalar control unit and the vector operation unit, and interaction overhead in between is minimized.

Experiments show that on a variety of emerging deep learning tasks with complex control logic, the CPULESS system can achieve 10.30x speedup and save 92.99% of the energy consumption compared to the conventional CPU-centric discrete GPU system.

Published in “IEEE Transactions on Computers”. [DOI] [Artifact]