🌐中文
目录
  1. MNIST-style Pageview Counter
  2. Cambricon-Q: A Hybrid Architecture for Efficient Training
  3. Cambricon-FR:Fractal Reconfigurable ISA Machines (Universal Fractal Machines)
  4. Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture

MNIST-style Pageview Counter

The pageview counter designed for this site, i.e., the one at the footer. I estimate that 99% of people nowadays use Python when dealing with MNIST. But this site promises not to use Python, thus chooses C++ for this gadget. The overall experience is not much complicated than Python, while it is expected to save a lot of carbons😁.

Implementation

Download MNIST testset and uncompress by gzip. LeCun said that the first 5k images are easier, so the program only use the first 5k images. Counter is saved into a file fcounter.db. Digits are randomly chosen from test images, assembled into one PNG image, (with Magick++ which is easier), then returned to the webserver via FastCGI.

Code
counter.cppview raw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
#include <iostream>
#include <fstream>
#include <cstring>
#include <random>
#include <map>
#include <Magick++.h>
#include <fcgio.h>

std::ifstream t10k_images("./t10k-images-idx3-ubyte", std::ios_base::binary | std::ios_base::in);
std::ifstream t10k_labels("./t10k-labels-idx1-ubyte", std::ios_base::binary | std::ios_base::in);
size_t count = 0;

enum {
IMAGE_IN = 16,
IMAGE_NUM = 5000,
IMAGE_SIZE = 28,
IMAGE_BYTE = IMAGE_SIZE * IMAGE_SIZE,
LABEL_IN = 8,
};

std::multimap<int, size_t> categories;
std::random_device rd;
std::mt19937 mt(rd());

void init() {
t10k_labels.seekg(LABEL_IN);
for (size_t i = 0; i < IMAGE_NUM; i++) {
unsigned char c;
t10k_labels.read(reinterpret_cast<char*>(&c), 1);
categories.insert({c, i * IMAGE_BYTE + IMAGE_IN});
}
std::ifstream fcounter("./fcounter.db", std::ios_base::binary | std::ios_base::in);
fcounter.read(reinterpret_cast<char*>(&count), sizeof(count));
fcounter.close();
}

void select(std::array<unsigned char, IMAGE_BYTE>& img, unsigned char c) {
auto range = categories.equal_range(c);
auto first = range.first; auto last = range.second;
auto n = std::distance(first, last);
std::uniform_int_distribution<> dist(0, n - 1);
auto sk = std::next(first, dist(mt))->second;
t10k_images.seekg(sk);
t10k_images.read(reinterpret_cast<char*>(img.data()), IMAGE_BYTE);
}

void hit(std::ostream& os) {
count++;
std::ofstream fcounter("./fcounter.db", std::ios_base::binary | std::ios_base::out);
fcounter.write(reinterpret_cast<char*>(&count), sizeof(count));
fcounter.close();
std::string str = std::to_string(count);
if (str.length() < 6)
str = std::string(6 - str.length(), '0') + str;
size_t w = IMAGE_SIZE * str.length(), h = IMAGE_SIZE;
std::vector<unsigned char> canvas(w*h, 0);
size_t i = 0;
for (auto&& c : str) {
std::array<unsigned char, IMAGE_BYTE> img;
select(img, c - '0');
for (int y = 0; y < IMAGE_SIZE; y++) {
std::memcpy(&canvas[y * w + i * IMAGE_SIZE], &img[y * IMAGE_SIZE], IMAGE_SIZE);
}
i++;
}
Magick::Image image(IMAGE_SIZE*str.length(), IMAGE_SIZE, "I", Magick::CharPixel, canvas.data());
Magick::Blob blob;
image.type(Magick::GrayscaleType);
image.magick("PNG");
image.write(&blob);
os << "Content-Type: image/png\r\n";
os << "Content-length: " << blob.length() << "\r\n\r\n";
os.write(reinterpret_cast<const char*>(blob.data()), blob.length()) << std::flush;
}

int main() {
FCGX_Request request;
init();
FCGX_Init();
FCGX_InitRequest(&request, 0, 0);
while (FCGX_Accept_r(&request) == 0) {
fcgi_streambuf osbuf(request.out);
std::ostream os(&osbuf);
hit(os);
}
return 0;
}

The code is put in public domain now.

Compilation

  • Install APT packages libmagick++-dev libfcgi-dev.
  • For FastCGI++, compiler flags -lfcgi++ -lfcgi are required.
  • CMake can found Magick++ automatically. Otherwise, append compiler flags from magick++-config.

Deployment

I use spawn-fcgi to spawn the compiled binary (set as a service in systemd). Most webservers support FastCGI, a reverse proxy to the port set by spawn-fcgi completes the deployment. I use Caddy:

1
2
3
reverse_proxy /counter.png localhost:21930 {
transport fastcgi
}

Add /counter.png to the footer HTML, then you see digits bump on refresh. Go through internal links will not bump if the explorer cached the image.

Cambricon-Q: A Hybrid Architecture for Efficient Training

In order to pursue energy efficiency, most deep learning accelerators use 8-bit or even lower bit-width computing units, especially on mobile platforms. Such low-bit-width accelerators can meet the accuracy requirements of inference tasks with special technical means, but they cannot be used during training, because the numerical sensitivity of the training process is much higher. How to extend the architecture to enable efficient mobile training?

In response to this problem, we proposed Cambricon-Q.

Cambricon-Q has introduced three new modules:

  • SQU supports on-the-fly statistics and quantization;
  • QBC manages the mixed precision and data format for the on-chip buffers;
  • NDPO performs the weight update process at the near memory end.

The proposed architecture can support a variety of quantization-aware training algorithms. Experiments show that Cambricon-Q achieves efficient training with negligible accuracy loss.

Published on ISCA 2021. [DOI]

Cambricon-FR:Fractal Reconfigurable ISA Machines (Universal Fractal Machines)

This work follows Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture.

Cambricon-F obtains the programming scale-invariant property via fractal execution, alleviating the programming productivity issue of machine learning computers. However, the fractal execution on this computer is by the hardware controller and only supports a few common basic operators (convolution, pooling, etc.). Other functions need to be built on the sequence of these operators. We have found that when a limited and fixed instruction set is used to support complex and variable application payloads, inefficiency will occur.

When supporting regular algorithms such as conventional CNNs, the machine can achieve optimal efficiency. However, in complex and variable application scenarios, even if the application itself conforms to the definition of fractal operation, it will cause inefficiency phenomenon. The inefficiency phenomenon is defined as a suboptimal computational or communication complexity when certain applications are executed on a fractal computer. This paper uses TopK and 3DConv to illustrate the inefficiency phenomenon.

An intuitive example: The user wants to execute the application Bayesian Network, which conforms to the definition of fractal operation and can be executed efficiently in a fractal manner; But because there is no such “Bayesian” instruction in Cambricon-F, the application can only be decomposed into a series of basic operations and then executed serially. If the instruction set can be expanded, and a BAYES fractal instruction is added, the fractal execution can be maintained until the leaf node is reached, which significantly improves the computational efficiency.

Based on this, we improved the architecture of Cambricon-F and proposed Cambricon-FR with a fractal reconfigurable instruction set structure. Analytically, Cambricon-F is a Fractal Machine, while Cambricon-FR can be seen as a Universal Fractal Machine; Cambricon-F can achieve optimal efficiency on a specific application payload, while Cambricon-FR can achieve optimal efficiency on complex and variable application payloads.

Published in “IEEE Transactions on Computers”. [DOI]

Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture

During the work as a software architect in the Cambricon Tech, I deeply realized the pain points of software engineering. When I first took over in 2016, the core software was developed by me and WANG Yuqing, with 15,000 lines of code; when I left in 2018, the development team increased to more than 60 people, with 720,000 lines of code. From the perspective of lines, the complexity of software doubles every 5 months. No matter how much manpower is added, the team is still under tremendous development pressure: customer needs are urgent and need to be dealt with immediately; New features need to be developed, the accumulated old code needs to be refactored; the documentation has not yet been established; the tests have not yet been established…

I may not be a professional software architect, but who can guarantee that the future changes are foreseen from the very beginning? Just imagine: the underlying hardware was single-core; it became multi-core a year later; then it became NUMA another year later. With such a rapid evolution, how can the same software be able to keep up without undergoing thorough refactoring? The key to the problem is that, the scale of the hardware has increased, so the level of abstraction that needs to be programmed and controlled is also increasing, making programming more complicated. We define the problem as the programming scale-variance.

In order to solve this problem from engineering practices, we started the research, namely Cambricon-F.

Addressing the scale-variance of programming, it is necessary to introduce some kinds of scale invariants. The invariant we found is fractal: the geometric fractals are self-similar on different scales. We define the workload in a fractal manner, so does the hardware architecture. Both scale invariants can be zoomed freely until a scale that is compatible with each other is found.

Cambricon-F first proposed the Fractal von Neumann Architecture. The key features of this architecture are:

  • Sequential code, parallel execution adapted to the hardware scale automatically;
  • Programming Scale-invariance: hardware scale is not coded, therefore code transfers freely between different Cambricon-F instances;
  • High efficiency retained by fractal pipelining.

Published on ISCA 2019. [DOI]