Zhao, Yongwei

Hello World
Automated Placement and Routing Tool for Standard CMOS Circuit Schematics
Cambricon-C: Efficient 4-Bit Matrix Unit via Primitivization
The Application of Skew Ternary Numbers in Gomoku AI Programs
Thoughts on PIM after Cambricon-U
Cambricon-U：A Systolic Random Increment Memory
Yongwei's ten rules on thesis writing
Cambricon-P：A Bitflow Architecture for Arbitrary Precision Computing
CCF Chips 2022: The Post Golden Age of Computer Architectures, What's Next to the Deep Learning Processors?
MPMD decides Halting

回到顶部

Hello World

2021-11-14 STICKY

右上角可以切换中文。

This is a Python-free site.

Automated Placement and Routing Tool for Standard CMOS Circuit Schematics

2025-04-17 technotes

(ZH-EN Translation by DeepSeek-V3-0324)

Background

Recently, the national key lab planned to initiate work on intelligent circuit design, and while preparing materials, we often needed circuit schematics as examples. However, there were no readily available tools to conveniently draw these schematics. Existing schematic drawing tools require manually placing each component and connecting each wire one by one. Even drawing a small-scale example involves tedious operations. For instance, a 4-bit adder contains hundreds of transistors, and manually placing components for larger examples becomes nearly as labor-intensive as designing a 4004 processor from scratch. Moreover, real synthesis and placement and routing (PNR) tools are not only cumbersome to invoke but also produce visually unappealing diagrams. As a result, I have always resorted to writing my own scripts to generate these diagrams.

Initially, I used Python with PIL (Python Imaging Library) to draw the schematics. I wrote a few basic functions to draw transistors and logic gates at specified positions, but the placement and routing still had to be designed manually, resulting in low efficiency. I then decided to add simple placement and routing logic to the script to automate the entire workflow from input circuit netlists to output schematics. Here is the final result:

For computationally intensive tasks like placement and routing, I immediately abandoned Python in favor of C++ to avoid waiting an entire day to generate a single diagram. The routing algorithm reused a maze routing implementation I had previously written for another project. I then pasted the routing code into DeepSeek and asked it to complete the placement code. DeepSeek-V3-0324 demonstrated impressive coding capabilities, producing functional and complete code that worked right away. With the help of the large language model, along with some research and clever ideas, abandoning PIL did not cause much pain. DeepSeek provided code for outputting PNGs, extracting BDF bitmap fonts, and helped debug a bug in the diagonal line rasterization algorithm.

Thus, I quickly implemented this ~800-line automated circuit placement and routing program that takes BLIF as input and outputs PNGs. It is entirely self-contained, with dependencies limited to the C++ Standard Template Library (STL). Even as someone who has been programming since childhood, this coding experience was entirely new to me, showcasing how large language models are reshaping the programming experience. Without such models, reading RFCs for PNG, BDF, DEFLATE, and writing low-level functions would have been extremely painful, and I would never have attempted it.

Abandoning Python and returning to C++ has never been easier, and presumably, the same applies to abandoning CUDA. In this era, ecosystem barriers are being weakened across the board—an exciting trend!

Code

The code uses C++26. Currently, only GCC 15.0 can compile it, but since 15.0 has not yet been released, I compiled the latest GCC from source.

This is my first extensive use of the ranges library, and the experience has been great—it often saves a lot of unnecessary verbosity. The current ranges implementation still lacks some features, such as a reverse drop (removing elements from the end of a range). It’s also inconvenient to determine whether a loop iteration is the first or last, making the old-fashioned way of writing loops sometimes preferable.

The C++ standard also lacks some key containers, such as a trie—why hasn’t this been added yet? The performance of some older containers has long been criticized, and it’s time for a refresh. The latest standard provides flat_map, but in my testing, it showed no performance advantage in my use case. I didn’t see an unordered_flat_map. Hopefully, these new features will avoid repeating the mistakes of regex.

Expand code

cirschem.cppview raw

/* ============================================================================
     place and route a standard CMOS circuit schematic.
         input:  Berkeley Logic Interchange Format (BLIF)
         output: black-white schematic image (PNG)

     compiling options: g++ cirschem.cpp -std=c++26 -O3
     requires c++26 (g++15 or above)

     Copyright (c) 2025 zhaoyongwei<zhaoyongwei@ict.ac.cn>
     IPRC, ICT CAS. Visit https://yongwei.site/pnr-circuit-schematic

     Permission is hereby granted, free of charge, to any person obtaining a copy
     of this software and associated documentation files (the "Software"), to deal
     in the Software without restriction, including without limitation the rights
     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
     copies of the Software, and to permit persons to whom the Software is
     furnished to do so, subject to the following conditions:
     
     The above copyright notice and this permission notice shall be included in all
     copies or substantial portions of the Software.
     
     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
     SOFTWARE.
 
 * ============================================================================ */

#include <iostream>
#include <format>
#include <fstream>
#include <string>
#include <vector>
#include <deque>
#include <array>
#include <tuple>
#include <set>
#include <map>
#include <unordered_map>
#include <utility>
#include <algorithm>
#include <ranges>
#include <random>
#include <cmath>
#include <cstdint>
#include <chrono>
using namespace std::chrono_literals;
using namespace std::literals;

// simulated annealing placement parameters
constexpr double initial_temperature = std::exp(5);
constexpr double cooling_rate        = 0.99999;
constexpr double stop_temperature    = 0.1;
constexpr auto step_timeout          = 5ms;
// placement optimizing target
constexpr double initial_area_margins = std::exp(1);
constexpr int area_cost = 5;
constexpr int cross_track_penalty = 50;
// output file name
constexpr auto schematic_filename = "schematic.png";

enum {
  NOT = 0,
  NAND = 1,
};

std::tuple<std::set<int>,
           std::set<int>,
           std::unordered_map<int, std::pair<int, std::vector<int>>>,
           std::vector<std::string>>
parse(int argc, char* argv[]) {

    std::set<int> inputs;
    std::set<int> outputs;
    std::unordered_map<int, std::pair<int, std::vector<int>>> gates;
    std::vector<std::string> names {"NOT"s, "NAND"s};
    std::map<std::string, int> name_ids {{"NOT"s, 0}, {"NAND"s, 1}}; // compromised alt for trie.
    auto parse_name = [&](std::string str) {
        if (!name_ids.contains(str)) {
            name_ids.emplace(str, names.size());
            names.push_back(str);
        }
        return name_ids.at(str);
    };

    if (argc < 2) {
        std::cerr << "Usage: " << argv[0] << " <blif_file>\n";
        return {inputs, outputs, gates, names};
    }

    std::ifstream blif(argv[1]);
    if (!blif.is_open()) {
        std::cerr << "Error opening file: " << argv[1] << '\n';
        return {inputs, outputs, gates, names};
    }

    std::string line;
    while (std::getline(blif, line)) {
        if (line.starts_with(".inputs")) {
            size_t start = 7; // length of ".inputs"
            while (start < line.size()) {
                size_t end = line.find(' ', start);
                if (end == std::string::npos) end = line.size();
                std::string token = line.substr(start, end - start);
                if (!token.empty()) {
                    inputs.insert(parse_name(token));
                }
                start = end + 1;
            }
        } else if (line.starts_with(".outputs")) {
            size_t start = 8; // length of ".outputs"
            while (start < line.size()) {
                size_t end = line.find(' ', start);
                if (end == std::string::npos) end = line.size();
                std::string token = line.substr(start, end - start);
                if (!token.empty()) {
                    outputs.insert(parse_name(token));
                }
                start = end + 1;
            }
        } else if (line.starts_with(".subckt")) {
            std::vector<int> ports;
            size_t start = 8; // length of ".subckt"
            while (start < line.size()) {
                size_t end = line.find(' ', start);
                if (end == std::string::npos) end = line.size();
                std::string token = line.substr(start, end - start);
                if (!token.empty()) {
                    size_t start = token.find('=', 0);
                    if (start != std::string::npos) {
                        token = token.substr(start+1, token.size());
                    }
                    ports.push_back(parse_name(token));
                }
                start = end + 1;
            }
            int gate = ports.front();
            int name = ports.back();
            ports.erase(ports.begin());
            ports.pop_back();
            gates[name] = { gate, ports };
        }
    }

    return {inputs, outputs, gates, names};
}

template<typename T, T... Ints>
constexpr auto make_crc_table(std::integer_sequence<T, Ints...> is) {
    return std::array<T, sizeof...(Ints)>{ [](T n){
               for (int k : std::views::iota(0,8)) n = (n>>1)^(n&1?0xedb88320:0); return n;
           }(Ints)...};
}
constexpr auto crc_table{ make_crc_table(std::make_integer_sequence<uint32_t, 256>{}) };

template<typename T>
constexpr uint32_t crc(const T& buf) {
    return ~std::ranges::fold_left(buf, 0xffffffff, [](uint32_t c, uint8_t d){
                return crc_table[(c ^ d) & 0xff] ^ (c>>8);
            });
}

template<typename T>
constexpr uint32_t adler32(const T& buf) {
    return std::ranges::fold_left(buf, 1, [](uint32_t ab, uint8_t d){
               auto a = ((ab & 0xffff) + d) % 65521; return a | ((((ab >> 16) + a) % 65521) << 16);
           });
}

template<typename Iter, typename T>
void insert_be(Iter it, T value) {
    auto value_be = std::endian::native == std::endian::little ? std::byteswap(value) : value;
    auto arr = std::bit_cast<std::array<uint8_t,sizeof(T)>>(value_be);
    std::ranges::copy(arr, it);
}
template<typename Iter, typename T>
void insert_le(Iter it, T value) {
    auto value_le = std::endian::native == std::endian::big ? std::byteswap(value) : value;
    auto arr = std::bit_cast<std::array<uint8_t,sizeof(T)>>(value_le);
    std::ranges::copy(arr, it);
}

std::vector<uint8_t> deflate(const std::vector<uint8_t>& data) {
    std::vector<uint8_t> compressed_data;
    auto it = std::back_inserter(compressed_data);
    for (auto chunk : data | std::views::chunk(0xfffb)) {
        *it = 0;
        insert_le(it, static_cast<uint16_t>( chunk.size()));
        insert_le(it, static_cast<uint16_t>(~chunk.size()));
        std::ranges::copy(chunk | std::views::transform(std::bit_not<uint8_t>{}), it);
    }
    compressed_data[(compressed_data.size()-1) & ~0xffffULL] = 0x1;
    return compressed_data;
}

constexpr std::array<std::tuple<int,std::array<uint16_t,8>>,96> font {{
{ 0, {0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000,0x0000}}, //
{ 0, {0x0000,0x0000,0x0380,0x6FC0,0x6FC0,0x0380,0x0000,0x0000}}, // !
{ 6, {0x0000,0x3800,0x7800,0x0000,0x0000,0x7800,0x3800,0x0000}}, // "
{ 0, {0x1100,0x7FC0,0x7FC0,0x1100,0x7FC0,0x7FC0,0x1100,0x0000}}, // #
{-2, {0x0CE0,0x19F0,0x1110,0x711C,0x711C,0x1F30,0x0E60,0x0000}}, // $
{ 0, {0x4300,0x6300,0x3000,0x1800,0x0C00,0x6600,0x6300,0x0000}}, // %
{ 0, {0x3800,0x7D80,0x47C0,0x4E40,0x3BC0,0x7D80,0x4400,0x0000}}, // &
{ 6, {0x0000,0x4000,0x7800,0x3800,0x0000,0x0000,0x0000,0x0000}}, // '
{ 0, {0x0000,0x0000,0x1F00,0x3F80,0x60C0,0x4040,0x0000,0x0000}}, // (
{ 0, {0x0000,0x0000,0x4040,0x60C0,0x3F80,0x1F00,0x0000,0x0000}}, // )
{ 2, {0x1000,0x5400,0x7C00,0x3800,0x3800,0x7C00,0x5400,0x1000}}, // *
{ 2, {0x0000,0x1000,0x1000,0x7C00,0x7C00,0x1000,0x1000,0x0000}}, // +
{-1, {0x0000,0x0000,0x4000,0x7800,0x3800,0x0000,0x0000,0x0000}}, // ,
{ 4, {0x4000,0x4000,0x4000,0x4000,0x4000,0x4000,0x4000,0x0000}}, // -
{ 0, {0x0000,0x0000,0x0000,0x6000,0x6000,0x0000,0x0000,0x0000}}, // .
{ 1, {0x6000,0x3000,0x1800,0x0C00,0x0600,0x0300,0x0180,0x0000}}, // /
{ 0, {0x3F80,0x7FC0,0x4C40,0x4640,0x4340,0x7FC0,0x3F80,0x0000}}, // 0
{ 0, {0x0000,0x4100,0x4180,0x7FC0,0x7FC0,0x4000,0x4000,0x0000}}, // 1
{ 0, {0x6080,0x70C0,0x5840,0x4C40,0x4640,0x63C0,0x6180,0x0000}}, // 2
{ 0, {0x2080,0x60C0,0x4440,0x4440,0x4440,0x7FC0,0x3B80,0x0000}}, // 3
{ 0, {0x0C00,0x0E00,0x0B00,0x4980,0x7FC0,0x7FC0,0x4800,0x0000}}, // 4
{ 0, {0x27C0,0x67C0,0x4440,0x4440,0x4440,0x7C40,0x3840,0x0000}}, // 5
{ 0, {0x3F00,0x7F80,0x44C0,0x4440,0x4440,0x7C00,0x3800,0x0000}}, // 6
{ 0, {0x00C0,0x00C0,0x7840,0x7C40,0x0640,0x03C0,0x01C0,0x0000}}, // 7
{ 0, {0x3B80,0x7FC0,0x4440,0x4440,0x4440,0x7FC0,0x3B80,0x0000}}, // 8
{ 0, {0x0380,0x47C0,0x4440,0x4440,0x6440,0x3FC0,0x1F80,0x0000}}, // 9
{ 1, {0x0000,0x0000,0x0000,0x6300,0x6300,0x0000,0x0000,0x0000}}, // :
{ 0, {0x0000,0x0000,0x4000,0x7180,0x3180,0x0000,0x0000,0x0000}}, // ;
{ 0, {0x0000,0x0400,0x0E00,0x1B00,0x3180,0x60C0,0x4040,0x0000}}, // <
{ 2, {0x0000,0x4800,0x4800,0x4800,0x4800,0x4800,0x4800,0x0000}}, // =
{ 0, {0x0000,0x4040,0x60C0,0x3180,0x1B00,0x0E00,0x0400,0x0000}}, // >
{ 0, {0x0180,0x01C0,0x0040,0x6C40,0x6E40,0x03C0,0x0180,0x0000}}, // ?
{ 0, {0x3F80,0x7FC0,0x4040,0x5E40,0x5E40,0x5FC0,0x0F80,0x0000}}, // @
{ 0, {0x7E00,0x7F00,0x0980,0x08C0,0x0980,0x7F00,0x7E00,0x0000}}, // A
{ 0, {0x4040,0x7FC0,0x7FC0,0x4440,0x4440,0x7FC0,0x3B80,0x0000}}, // B
{ 0, {0x1F00,0x3F80,0x60C0,0x4040,0x4040,0x60C0,0x3180,0x0000}}, // C
{ 0, {0x4040,0x7FC0,0x7FC0,0x4040,0x60C0,0x3F80,0x1F00,0x0000}}, // D
{ 0, {0x4040,0x7FC0,0x7FC0,0x4440,0x4E40,0x60C0,0x71C0,0x0000}}, // E
{ 0, {0x4040,0x7FC0,0x7FC0,0x4440,0x0E40,0x00C0,0x01C0,0x0000}}, // F
{ 0, {0x1F00,0x3F80,0x60C0,0x4840,0x4840,0x38C0,0x7980,0x0000}}, // G
{ 0, {0x7FC0,0x7FC0,0x0400,0x0400,0x0400,0x7FC0,0x7FC0,0x0000}}, // H
{ 0, {0x0000,0x0000,0x4040,0x7FC0,0x7FC0,0x4040,0x0000,0x0000}}, // I
{ 0, {0x3000,0x7000,0x4000,0x4040,0x7FC0,0x3FC0,0x0040,0x0000}}, // J
{ 0, {0x4040,0x7FC0,0x7FC0,0x0400,0x1F00,0x7BC0,0x60C0,0x0000}}, // K
{ 0, {0x4040,0x7FC0,0x7FC0,0x4040,0x4000,0x6000,0x7000,0x0000}}, // L
{ 0, {0x7FC0,0x7FC0,0x0380,0x0700,0x0380,0x7FC0,0x7FC0,0x0000}}, // M
{ 0, {0x7FC0,0x7FC0,0x0380,0x0700,0x0E00,0x7FC0,0x7FC0,0x0000}}, // N
{ 0, {0x1F00,0x3F80,0x60C0,0x4040,0x60C0,0x3F80,0x1F00,0x0000}}, // O
{ 0, {0x4040,0x7FC0,0x7FC0,0x4440,0x0440,0x07C0,0x0380,0x0000}}, // P
{-1, {0x0FC0,0x1FE0,0x1020,0x1C20,0x7820,0x7FE0,0x4FC0,0x0000}}, // Q
{ 0, {0x4040,0x7FC0,0x7FC0,0x0440,0x0C40,0x7FC0,0x7380,0x0000}}, // R
{ 0, {0x3180,0x73C0,0x4640,0x4440,0x4C40,0x79C0,0x3180,0x0000}}, // S
{ 0, {0x0000,0x01C0,0x40C0,0x7FC0,0x7FC0,0x40C0,0x01C0,0x0000}}, // T
{ 0, {0x3FC0,0x7FC0,0x4000,0x4000,0x4000,0x7FC0,0x3FC0,0x0000}}, // U
{ 0, {0x0FC0,0x1FC0,0x3000,0x6000,0x3000,0x1FC0,0x0FC0,0x0000}}, // V
{ 0, {0x1FC0,0x7FC0,0x7000,0x3C00,0x7000,0x7FC0,0x1FC0,0x0000}}, // W
{ 0, {0x60C0,0x71C0,0x1F00,0x0E00,0x1F00,0x71C0,0x60C0,0x0000}}, // X
{ 0, {0x0000,0x03C0,0x47C0,0x7C00,0x7C00,0x47C0,0x03C0,0x0000}}, // Y
{ 0, {0x71C0,0x78C0,0x4C40,0x4640,0x4340,0x61C0,0x70C0,0x0000}}, // Z
{ 0, {0x0000,0x0000,0x7FC0,0x7FC0,0x4040,0x4040,0x0000,0x0000}}, // [
{ 0, {0x01C0,0x0380,0x0700,0x0E00,0x1C00,0x3800,0x7000,0x0000}}, //'\'
{ 0, {0x0000,0x0000,0x4040,0x4040,0x7FC0,0x7FC0,0x0000,0x0000}}, // ]
{ 7, {0x4000,0x6000,0x3000,0x1800,0x3000,0x6000,0x4000,0x0000}}, // ^
{-2, {0x4000,0x4000,0x4000,0x4000,0x4000,0x4000,0x4000,0x4000}}, // _
{ 8, {0x0000,0x0000,0x3000,0x7000,0x4000,0x0000,0x0000,0x0000}}, // `
{ 0, {0x3000,0x7A00,0x4A00,0x4A00,0x3E00,0x7C00,0x4000,0x0000}}, // a
{ 0, {0x0040,0x7FC0,0x7FC0,0x4200,0x4600,0x7C00,0x3800,0x0000}}, // b
{ 0, {0x3C00,0x7E00,0x4200,0x4200,0x4200,0x6600,0x2400,0x0000}}, // c
{ 0, {0x3800,0x7C00,0x4600,0x4240,0x3FC0,0x7FC0,0x4000,0x0000}}, // d
{ 0, {0x3C00,0x7E00,0x4A00,0x4A00,0x4A00,0x6E00,0x2C00,0x0000}}, // e
{ 0, {0x4400,0x7F80,0x7FC0,0x4440,0x00C0,0x0180,0x0000,0x0000}}, // f
{-2, {0x2700,0x6F80,0x4880,0x4880,0x7F00,0x3F80,0x0080,0x0000}}, // g
{ 0, {0x4040,0x7FC0,0x7FC0,0x0400,0x0200,0x7E00,0x7C00,0x0000}}, // h
{ 0, {0x0000,0x0000,0x4200,0x7EC0,0x7EC0,0x4000,0x0000,0x0000}}, // i
{-2, {0x0000,0x3000,0x7000,0x4000,0x4080,0x7FB0,0x3FB0,0x0000}}, // j
{ 0, {0x4040,0x7FC0,0x7FC0,0x0800,0x1C00,0x7600,0x6200,0x0000}}, // k
{ 0, {0x0000,0x0000,0x4040,0x7FC0,0x7FC0,0x4000,0x0000,0x0000}}, // l
{ 0, {0x7E00,0x7E00,0x0600,0x3C00,0x0600,0x7E00,0x7C00,0x0000}}, // m
{ 0, {0x0200,0x7E00,0x7C00,0x0200,0x0200,0x7E00,0x7C00,0x0000}}, // n
{ 0, {0x3C00,0x7E00,0x4200,0x4200,0x4200,0x7E00,0x3C00,0x0000}}, // o
{-2, {0x4080,0x7F80,0x7F00,0x4880,0x0880,0x0F80,0x0700,0x0000}}, // p
{-2, {0x0700,0x0F80,0x0880,0x4880,0x7F00,0x7F80,0x4080,0x0000}}, // q
{ 0, {0x4200,0x7E00,0x7C00,0x4600,0x0200,0x0E00,0x0C00,0x0000}}, // r
{ 0, {0x2400,0x6E00,0x4A00,0x5A00,0x5200,0x7600,0x2400,0x0000}}, // s
{ 0, {0x0200,0x0200,0x3F80,0x7FC0,0x4200,0x6200,0x2000,0x0000}}, // t
{ 0, {0x3E00,0x7E00,0x4000,0x4000,0x3E00,0x7E00,0x4000,0x0000}}, // u
{ 0, {0x0000,0x1E00,0x3E00,0x6000,0x6000,0x3E00,0x1E00,0x0000}}, // v
{ 0, {0x3E00,0x7E00,0x6000,0x3800,0x6000,0x7E00,0x3E00,0x0000}}, // w
{ 0, {0x4200,0x6600,0x3C00,0x1800,0x3C00,0x6600,0x4200,0x0000}}, // x
{-2, {0x4780,0x4F80,0x4800,0x4800,0x6800,0x3F80,0x1F80,0x0000}}, // y
{ 0, {0x4600,0x6600,0x7200,0x5A00,0x4E00,0x6600,0x6200,0x0000}}, // z
{ 0, {0x0000,0x0400,0x0400,0x3F80,0x7BC0,0x4040,0x4040,0x0000}}, // {
{ 0, {0x0000,0x0000,0x0000,0x7BC0,0x7BC0,0x0000,0x0000,0x0000}}, // |
{ 0, {0x0000,0x4040,0x4040,0x7BC0,0x3F80,0x0400,0x0400,0x0000}}, // }
{ 7, {0x4000,0x6000,0x2000,0x6000,0x4000,0x6000,0x2000,0x0000}}, // ~
{ 1, {0x7000,0x7800,0x4C00,0x4600,0x4C00,0x7800,0x7000,0x0000}}  //
}};

struct draw_t {
    size_t w, row_bytes;
    size_t h;
    std::vector<uint8_t> pixels;
    draw_t(size_t w, size_t h) : w(w), row_bytes(1 + (w+7)/8), h(h), pixels(h * row_bytes) {
        for (auto & x: pixels | std::views::stride(row_bytes))
            x = 0xff;
    }

    void set_pixel(int x, int y, bool value) {
        uint8_t m = 0x80 >> (x & 7);
        size_t addr = y * row_bytes + 1 + x/8;
        if (value) {
            pixels[addr] |= m;
        } else {
            pixels[addr] &= ~m;
        }
    }
    void set_hline(int x1, int x2, int y, bool value) {
        uint8_t m1 =  ((0x100 >> (x1 & 7)) - 1);
        uint8_t m2 = ~((0x080 >> (x2 & 7)) - 1);
        size_t base_addr = y * row_bytes + 1;
        if (x1/8 == x2/8) {
            if (value) {
                pixels[base_addr + x1/8] |= m1 & m2;
            } else {
                pixels[base_addr + x1/8] &= ~(m1 & m2);
            }
        } else {
            if (value) {
                pixels[base_addr + x1/8] |= m1;
                pixels[base_addr + x2/8] |= m2;
            } else {
                pixels[base_addr + x1/8] &= ~m1;
                pixels[base_addr + x2/8] &= ~m2;
            }
            for (int addr : std::views::iota(base_addr + x1/8 + 1, base_addr + x2/8))
                pixels[addr] = -!!value;
        }
    }
    void set_slope(int x1, int x2, int y1, int y2, float thick, bool value) {
        float slope = 1.0 * (x2 - x1) / std::abs(y2 - y1);
        float tan = 0.5 / std::sqrt(1. + slope*slope);
        int x_low  = std::min<int>(std::max<int>(0, x1 - tan*thick), w-1);
        int x_high = std::min<int>(std::max<int>(0, x2 + tan*thick), w-1);
        for (int d : std::views::iota(0, std::abs(y2-y1)+1)) {
            int l = (d + 0.5) * slope - tan * thick;
            int r = (d + 1.5) * slope + tan * thick;
            int y = y2 > y1 ? y1 + d : y1 - d;
            l = std::max(x_low, std::min(l + x1, x_high));
            r = std::max(l,     std::min(r - 1 + x1, x_high));
            set_hline(l, r, y, value);
        }
    }

    void write_png(const std::string& filename) { 
        std::ofstream out(filename, std::ios::binary);
        uint32_t width = w;
        uint32_t height = h;
        std::vector<uint8_t> png_data {137, 80, 78, 71, 13, 10, 26, 10};
        auto png = std::back_inserter(png_data);
        std::vector<uint8_t> ihdr_data {'I', 'H', 'D', 'R'};
        auto ihdr = std::back_inserter(ihdr_data);
        insert_be(ihdr, width);
        insert_be(ihdr, height);
        std::ranges::copy(std::array<uint8_t,5>{1,0,0,0,0}, ihdr);
        insert_be(png, uint32_t{13});
        std::ranges::copy(ihdr_data, png);
        insert_be(png, crc(ihdr_data));
        std::vector<uint8_t> compressed_data { deflate(pixels) };
        insert_be(std::back_inserter(compressed_data), adler32(pixels));
        constexpr std::array<uint8_t,6> idat_type {'I', 'D', 'A', 'T', 0x78, 0x01};
        insert_be(png, static_cast<uint32_t>(compressed_data.size() + 2));
        std::ranges::copy(idat_type, png);
        std::ranges::copy(compressed_data, png);
        insert_be(png, crc(std::views::concat(idat_type, compressed_data)));
        constexpr std::array<uint8_t, 4> iend_type {'I', 'E', 'N', 'D'};
        insert_be(png, uint32_t{0});
        std::ranges::copy(iend_type, png);
        insert_be(png, crc(iend_type));
        out.write(reinterpret_cast<const char*>(png_data.data()), png_data.size());
    }

    struct kwargs_t {int fill = 0; int outline=0; int width = 1; int m = 0; int os = 0;};
    void line(std::tuple<int,int,int,int> coord, kwargs_t kwargs) {
        auto [x1,y1,x2,y2] = coord;
        x1 = std::min<int>(w-1, std::max<int>(x1, 0));
        x2 = std::min<int>(w-1, std::max<int>(x2, 0));
        y1 = std::min<int>(h-1, std::max<int>(y1, 0));
        y2 = std::min<int>(h-1, std::max<int>(y2, 0));
        if (y1 == y2)
            for (int i : std::views::iota(0,kwargs.width))
                set_hline(std::min(x1, x2), std::max(x1, x2), y1 + (i+1)/2 * ((i&1)*2-1), !kwargs.fill);
        else {
            if (x1 > x2) {
                std::tie(x1, x2) = std::tuple{x2, x1};
                std::tie(y1, y2) = std::tuple{y2, y1};
            }
            set_slope(x1, x2, y1, y2, kwargs.width, !kwargs.fill);
        }
    }
    void ellipse(std::tuple<int,int,int,int> coord, kwargs_t kwargs) {
        auto [x1,y1,x2,y2] = coord;
        std::tie(x1, x2) = std::tuple{std::min(x1, x2), std::max(x1, x2)};
        std::tie(y1, y2) = std::tuple{std::min(y1, y2), std::max(y1, y2)};
        int dx = (x2 - x1) + kwargs.width;
        float ox = (x1 + x2) * 0.5;
        float oy = (y1 + y2) * 0.5;
        float ro = dx * 0.5, ri = dx * 0.5 - kwargs.width;
        for (int y : std::views::iota(std::max(0,y1-kwargs.width), std::min<int>(h-1,y2+kwargs.width+1))) {
            for (int x : std::views::iota(std::max(0,x1-kwargs.width), std::min<int>(w-1,x2+kwargs.width+1))) {
                float rsq = (x-ox)*(x-ox) + (y-oy)*(y-oy);
                if (rsq <= ri*ri)
                    set_pixel(x, y, !kwargs.fill);
                else if (rsq <= ro*ro)
                    set_pixel(x, y, !kwargs.outline);
            }
        }
    }
    void text(std::tuple<int,int> coord, std::string text, std::string anchor) {
        auto [l,t] = coord;
        int length = text.size() * 8;
        auto [r,b] = std::tuple{l + length, t + 15};
        if (anchor[0] == 'r') {
            l -= length; r -= length;
        }
        for (int x : std::views::iota(std::max(l,0),std::min<int>(r,w))) {
            size_t nchr = (x - l) / 8;
            size_t ncol = (x - l) & 7;
            int chr = text[nchr];
            if (chr > 0x7f or chr < 0x20) chr = '?';
            const auto& [offset, dots] = font[chr - 0x20];
            
            for (int y : std::views::iota(std::max(t-offset,0), std::min<int>(b-offset,h))) {
                if ((dots[ncol] >> (y - t + offset)) & 1)
                    set_pixel(x, y, 1);
            }
        }
        
    }
    void mos(int x, int y) {
        line({x+15,y+25,x+20,y+25},{.fill=1,.width=2});
        line({x+15,y+10,x+15,y+40},{.fill=0,.width=2});
        line({x+20,y+10,x+20,y+40},{.fill=0,.width=2});
        line({x+25,y   ,x+25,y+10},{.fill=0,.width=2});
        line({x+25,y+40,x+25,y+50},{.fill=0,.width=2});
        line({x+20,y+10,x+25,y+10},{.fill=0,.width=2});
        line({x+20,y+40,x+25,y+40},{.fill=0,.width=2});
    }
    void nmos(std::tuple<int,int> coord) {
        auto [x,y] = coord;
        line({x,y+25,x+15,y+25},{.fill=0,.width=2});
        mos(x, y);
    }
    void pmos(std::tuple<int,int> coord) {
        auto [x,y] = coord;
        line({x,y+25,x+11,y+25},{.fill=0,.width=2});
        ellipse({x+11,y+22,x+15,y+27},{.fill=1,.outline=0,.width=2});
        mos(x, y);
    }
    void inv(std::tuple<int,int> coord) {
        auto [x,y] = coord;
        ellipse({x+35-2,y-2,x+35+2,y+2},{.fill=0,.outline=0,.width=2});
        line({x+35,y,    x+35,y+10}, {.fill=0,.width=2});
        pmos({x+10,y+10});
        line({x+35,y+60, x+35,y+80}, {.fill=0,.width=2});
        line({x+35,y+70, x+60,y+70}, {.fill=0,.width=2});
        nmos({x+10,y+80});
        line({x+35,y+130,x+35,y+140},{.fill=0,.width=2});
        line({x+30,y+140,x+40,y+140},{.fill=0,.width=1});
        line({x+30,y+140,x+35,y+145},{.fill=0,.width=1});
        line({x+40,y+140,x+35,y+145},{.fill=0,.width=1});
        line({x,   y+35, x,   y+105},{.fill=0,.width=2});
        line({x,   y+35, x+10,y+35}, {.fill=0,.width=2});
        line({x,   y+105,x+10,y+105},{.fill=0,.width=2});
    }
    void nand(std::tuple<int,int> coord) {
        auto [x,y] = coord;
        ellipse({x+35-2, y-2,x+35+2, y+2},{.fill=0,.outline=0,.width=2});
        ellipse({x+115-2,y-2,x+115+2,y+2},{.fill=0,.outline=0,.width=2});
        line({x+35, y,    x+35, y+10}, {.fill=0,.width=2});
        pmos({x+10,y+10});
        line({x+115,y,    x+115,y+10}, {.fill=0,.width=2});
        pmos({x+90,y+10});
        line({x+35, y+60, x+35, y+80}, {.fill=0,.width=2});
        line({x+115,y+60, x+115,y+70}, {.fill=0,.width=2});
        line({x+35, y+70, x+140,y+70}, {.fill=0,.width=2});
        nmos({x+10,y+80});
        nmos({x+90,y+80});
        line({x+35, y+130,x+115,y+130},{.fill=0,.width=2});
        line({x+115,y+80, x+130,y+80}, {.fill=0,.width=2});
        line({x+130,y+80, x+130,y+140},{.fill=0,.width=2});
        line({x+125,y+140,x+135,y+140},{.fill=0,.width=1});
        line({x+125,y+140,x+130,y+145},{.fill=0,.width=1});
        line({x+135,y+140,x+130,y+145},{.fill=0,.width=1});
        line({x,    y+35, x,    y+105},{.fill=0,.width=2});
        line({x,    y+35, x+10, y+35}, {.fill=0,.width=2});
        line({x,    y+105,x+10, y+105},{.fill=0,.width=2});
        line({x+80, y+35, x+80, y+105},{.fill=0,.width=2});
        line({x+80, y+35, x+90, y+35}, {.fill=0,.width=2});
        line({x+80, y+105,x+90, y+105},{.fill=0,.width=2});
    }
};

const std::unordered_map<int, std::tuple<int, int>> CELL_SIZE = {
    {NOT,  {12, 28}},
    {NAND, {28, 28}}
};
const std::unordered_map<int, std::vector<std::tuple<int, int>>> PORT_OFFSET = {
    {NOT,  {{0,14},{12,14}}},
    {NAND, {{0,14},{16,14},{28,14}}}
};

constexpr auto operator+(const std::tuple<int,int>& lhs, const std::tuple<int,int>& rhs) {
    return std::tuple{std::get<0>(lhs)+std::get<0>(rhs), std::get<1>(lhs)+std::get<1>(rhs)};
}

std::tuple<long,
           std::vector<std::vector<std::tuple<int,int>>>,
           std::vector<std::tuple<int,int>>>
route(const std::set<int>& inputs,
      const std::set<int>& outputs,
      const std::unordered_map<int, std::pair<int, std::vector<int>>>& gates,
      const std::unordered_map<int, std::tuple<int,int>>& placement,
                        size_t canvas_w, size_t canvas_h) {
    std::vector<std::tuple<int,int,int,int,int,int>> todo;
    std::vector<std::vector<std::tuple<int,int>>> paths;
    std::vector<std::tuple<int,int>> solders;
    std::vector<int> drive(canvas_w*canvas_h);
    std::vector<int> mask(canvas_w*canvas_h);
    auto at = [canvas_h](size_t x, size_t y){ return x*canvas_h + y; };
    long cost = 0;
    for (const auto& [sink, values] : gates) {
        const auto& [sinkgatetype, inports] = values;
        for (const auto& [i, source] : std::views::enumerate(inports)) {
            if (inputs.contains(source) or outputs.contains(source)) continue;
            auto sourcegatetype = gates.at(source).first;
            auto [x1, y1] = placement.at(source) + PORT_OFFSET.at(sourcegatetype).back();
            auto [x2, y2] = placement.at(sink)   + PORT_OFFSET.at(sinkgatetype)[i];
            auto semipm = std::abs(x1 - (int)canvas_w/2) + std::abs(y1 - (int)canvas_h/2);

            todo.emplace_back(semipm, x1, y1, x2, y2, source);
        }
        static const std::unordered_map<int, void(*)(int,int,size_t,decltype(mask)&)> mask_func {
            {NOT,  +[](int X, int Y, size_t canvas_h, decltype(mask)& mask){
                       for (size_t x = X+1; x <= X+11; x++) for (size_t y = Y-2; y <= Y+30; y++) {
                           mask[x*canvas_h + y] = 3;
                       }
                   } },
            {NAND, +[](int X, int Y, size_t canvas_h, decltype(mask)& mask){
                       for (size_t x = X+1; x <= X+15; x++) for (size_t y = Y-2; y <= Y+30; y++) {
                           mask[x*canvas_h + y] = 3;
                       }
                       for (size_t x = X+15; x <= X+17; x++) for (size_t y = Y+8; y <= Y+20; y++) {
                           mask[x*canvas_h + y] = 3;
                       }
                       for (size_t x = X+17; x <= X+27; x++) for (size_t y = Y-2; y <= Y+30; y++) {
                           mask[x*canvas_h + y] = 3;
                       }
                   } },
        };
        auto [X, Y] = placement.at(sink);
        (*mask_func.at(sinkgatetype))(X, Y, canvas_h, mask);
    }
    for (size_t x = 0; x < canvas_w; x++) for (size_t y = 5; y < canvas_h; y+=40) {
        mask[at(x,y)] = 1;
    }
    std::sort(todo.begin(), todo.end());
    int done = 0;
    for (auto [_, x1, y1, x2, y2, driver_id] : todo) {
        std::vector<int> mark(canvas_w*canvas_h);
        bool marked = true, reached = false;
        int i = 0;
        size_t x, y;
        for (y = y2-7; y <= y2+7; y++)
            mark[at(x2,y)] = 1;
        do { i++;
            [&](){
                marked = false;
                for (x = 0; x < canvas_w; x++) for (y = 0; y < canvas_h; y++) {
                    if (mark[at(x,y)] == i) {
                        marked = true;
                        if (drive[at(x,y)] == driver_id or (x == x1 and y == y1)) {
                            reached = true; return;
                        }
                        if (x+1<canvas_w and (mask[at(x,y)]   & 1) == 0 and !mark[at(x+1,y)])
                            mark[at(x+1,y)] = i+1;
                        if (x>=1         and (mask[at(x-1,y)] & 1) == 0 and !mark[at(x-1,y)])
                            mark[at(x-1,y)] = i+1;
                        if (y+1<canvas_h and (mask[at(x,y)]   & 2) == 0 and !mark[at(x,y+1)])
                            mark[at(x,y+1)] = i+1;
                        if (y>=1         and (mask[at(x,y-1)] & 2) == 0 and !mark[at(x,y-1)])
                            mark[at(x,y-1)] = i+1;
                    }
                }
            }();
        } while (marked and !reached);
        if (!marked) {
            std::cout << "\nerror: routing failed" << std::endl;
            return { -1, paths, solders };
        }
        if (x != x1 or y != y1) solders.emplace_back(x,y);
        decltype(paths)::value_type path;
        int k = 1;
        while (mark[at(x,y)] > 1) {
            path.emplace_back(x,y);
            drive[at(x,y)] = driver_id;
            cost += 1;
            if        (k == 0 and x>=1           and mark[at(x-1,y)] > 0 and mark[at(x-1,y)] < mark[at(x,y)]) {
                mask[at(x-1,y)] |= 1;
                x--;
            } else if (k == 1 and x+1 < canvas_w and mark[at(x+1,y)] > 0 and mark[at(x+1,y)] < mark[at(x,y)]) {
                mask[at(x,y)]   |= 1;
                x++;
            } else if (k == 2 and y>=1           and mark[at(x,y-1)] > 0 and mark[at(x,y-1)] < mark[at(x,y)]) {
                mask[at(x,y-1)] |= 2;
                y--;
            } else if (k == 3 and y+1 < canvas_h and mark[at(x,y+1)] > 0 and mark[at(x,y+1)] < mark[at(x,y)]) {
                mask[at(x,y)]   |= 2;
                y++;
            } else if (x>=1           and mark[at(x-1,y)] > 0 and mark[at(x-1,y)] < mark[at(x,y)]) {
                mask[at(x-1,y)] |= 1;
                mask[at(x,y)]   |= 2;
                x--; k = 0;
            } else if (x+1 < canvas_w and mark[at(x+1,y)] > 0 and mark[at(x+1,y)] < mark[at(x,y)]) {
                mask[at(x,y)]   |= 3;
                x++; k = 1;
            } else if (y>=1           and mark[at(x,y-1)] > 0 and mark[at(x,y-1)] < mark[at(x,y)]) {
                mask[at(x,y)]   |= 1;
                mask[at(x,y-1)] |= 2;
                y--; k = 2;
            } else if (y+1 < canvas_h and mark[at(x,y+1)] > 0 and mark[at(x,y+1)] < mark[at(x,y)]) {
                mask[at(x,y)]   |= 3;
                y++; k = 3;
            } else {
                std::cout << "\nerror: routing failed" << std::endl;
                return { -1, paths, solders };
            }
        }
        path.emplace_back(x,y);
        drive[at(x,y)] = driver_id;
        paths.push_back(std::move(path));
        solders.emplace_back(x,y);
        for (y = y2-7; y <= y2+7; y++)
            drive[at(x2,y)] = driver_id;
        done++;
        std::cout << "[" << std::string(done * 70 / todo.size(), '=')
                  << '>' << std::string(70 - done * 70 / todo.size(), '.')
                  << std::format("] {:3d}%\r", done * 100 / todo.size());
    }
    std::cout << std::endl;
    return { cost, paths, solders };
}

std::tuple<size_t,size_t> snap_to_microgrid(
        std::unordered_map<int, std::tuple<int,int>>& placement,
        const std::unordered_map<int, std::pair<int, std::vector<int>>>& gates,
        double slackness) {
    int min_x = std::numeric_limits<int>::max();
    int min_y = std::numeric_limits<int>::max();
    int max_x = 0;
    int max_y = 0;
    for (auto& [gate, coord] : placement) {
        auto& [x, y] = coord;
        x = std::ceil(x * 2 * slackness); y = y * 40;
        min_x = std::min(x, min_x);
        min_y = std::min(y, min_y);
    }
    for (auto& [gate, coord] : placement) {
        auto& [x, y] = coord;
        x = x - min_x + 10; y = y - min_y + 5;
        auto [gate_size_x, gate_size_y] = CELL_SIZE.at(std::get<0>(gates.at(gate)));
        max_x = std::max(max_x, x + gate_size_x);
        max_y = std::max(max_y, y + gate_size_y);
    }
    return std::tuple<size_t,size_t>{ max_x + 10, max_y + 5 };
}

double fast_route_estimate(
        const std::unordered_map<int, std::tuple<int,int>>& placement,
        const std::set<int>& inputs, const std::set<int>& outputs,
        const std::unordered_map<int, std::pair<int, std::vector<int>>>& gates) {
    std::vector<std::tuple<int,int,int>> connections;
    auto [l, t] = placement.cbegin()->second;
    auto [r, b] = placement.cbegin()->second;
    for (const auto& [sink, values] : gates) {
        const auto& [sinkgatetype, inports] = values;
        auto [x, y] = placement.at(sink);
        l = std::min(l, x); t = std::min(t, y);
        r = std::max(r, x); b = std::max(b, y);
        for (const auto& [i, source] : std::views::enumerate(inports)) {
            if (inputs.contains(source) or outputs.contains(source)) continue;
            connections.emplace_back(source, sink, i);
        }
    }
    double cost = (r-l) * (b-t) * area_cost;
    for (const auto& [source, sink, port_id] : connections) {
        auto [sx, sy] = placement.at(source);
        auto [dx, dy] = placement.at(sink);
        auto [source_xoff, source_yoff] = PORT_OFFSET.at(std::get<0>(gates.at(source))).back();
        auto [sink_xoff, sink_yoff]     = PORT_OFFSET.at(std::get<0>(gates.at(sink  )))[port_id];
        cost += std::abs((sx+source_xoff)-(dx+sink_xoff)) + std::abs(((sy+source_yoff)-(dy+sink_yoff))*40);
        cost += (sy == dy) ? 0 : cross_track_penalty;
    }
    return cost;
}

std::unordered_map<int, std::tuple<int,int>> place(
        std::set<int> inputs, std::set<int> outputs,
        const std::unordered_map<int, std::pair<int, std::vector<int>>>& gates) {
    double temperature = initial_temperature;
    std::unordered_map<int, std::tuple<int,int>> placement;
    auto is_overlapping = [&](const decltype(placement)& placement) {
        std::deque<std::map<int,int>> track;
        int min_track = 0;
        for (const auto& [gate, coord] : placement) {
            const auto& [x, y] = coord;
            const auto& gatetype = std::get<0>(gates.at(gate));
            auto [width, _] = CELL_SIZE.at(gatetype);
            int l = x, r = x + width/2; 
            
            if (y - min_track >= (int)track.size()) {
                track.append_range(std::views::repeat(std::map<int,int>{}, y - min_track - track.size() + 1));
                track.back().emplace(l, r);
            } else if (y - min_track < 0) {
                track.prepend_range(std::views::repeat(std::map<int,int>{}, min_track - y));
                min_track = y;
                track.front().emplace(l, r);
            } else {
                auto& occupacy = track[y - min_track];
                auto il = std::ranges::upper_bound(std::views::values(occupacy), l).base();
                auto ir = occupacy.lower_bound(r);
                if (il == ir) occupacy.emplace_hint(ir, l, r);
                else return true;
            }
        }
        return false;
    };
    int estimated_tracks = std::round(std::sqrt(gates.size() * initial_area_margins));
    std::mt19937 rng(std::random_device{}());
    for (const auto& [gate, values] : gates) {
        const auto& gatetype = std::get<0>(values);
        do {
            int new_x = std::uniform_int_distribution<int>(0, estimated_tracks * 20)(rng);
            int new_y = std::uniform_int_distribution<int>(0, estimated_tracks     )(rng);
            placement[gate] = {new_x, new_y};
        } while (is_overlapping(placement));
    }
    double current_energy = fast_route_estimate(placement, inputs, outputs, gates);
    auto best_placement{ placement };
    auto best_energy{ current_energy };
    auto choice = std::uniform_int_distribution<int>(0, placement.size()-1);
    auto gauss = std::normal_distribution<double>(0, 1);
    auto dice = std::uniform_real_distribution<double>(0, 1);
    auto tick = std::chrono::system_clock::now();
    auto tock = tick;
    for (int iteration = 0; temperature > stop_temperature; iteration++) {
        auto new_placement{ placement };
        do {
            std::array<int,2> ridx { choice(rng), choice(rng) };
            std::tie(ridx[0], ridx[1]) = std::tuple<int,int>{ std::min(ridx[0], ridx[1]), std::abs(ridx[0]-ridx[1]) };
            auto iter = new_placement.begin();
            for (auto i : ridx) {
                std::advance(iter, i);
                std::get<0>(iter->second) += int(gauss(rng)*std::log(temperature)*20);
                std::get<1>(iter->second) += int(gauss(rng)*std::log(temperature));
            }
        } while (is_overlapping(new_placement));
        double new_energy = fast_route_estimate(new_placement, inputs, outputs, gates);
        if (new_energy < current_energy or dice(rng) < std::exp((current_energy-new_energy)/temperature)) {
            placement = std::move(new_placement);
            current_energy = new_energy;
            if (current_energy < best_energy) {
                best_placement = placement;
                best_energy = current_energy;
            }
        }
        temperature *= cooling_rate;
        auto now = std::chrono::system_clock::now();
        if (now - tick > 50ms) {
            std::cout << std::format("Temperature {:3.3f}, Energy {}  \r", temperature, current_energy) << std::flush;
            tick = now;
        }
        if (now - tock > step_timeout) break;
        tock = now;
    }
    std::cout << std::format("Temperature {:3.3f}, Energy {}  \n", temperature, best_energy);
    return best_placement;
}

void draw_schematic(
        const std::set<int>& inputs, const std::set<int>& outputs,
        const std::unordered_map<int, std::pair<int, std::vector<int>>>& gates,
        const std::vector<std::string> names,
        const std::vector<std::vector<std::tuple<int,int>>>& wires,
        const std::vector<std::tuple<int,int>>& solders,
        const std::unordered_map<int, std::tuple<int,int>>& placement,
        size_t canvas_w, size_t canvas_h) {
    draw_t draw(canvas_w*5, canvas_h*5);
    static const std::unordered_map<int, void(draw_t::*)(std::tuple<int,int>)> DRAWFN {
        {NOT, &draw_t::inv}, {NAND, &draw_t::nand}
    };
    // draw stdcells
    for (const auto& [name, values] : gates) {
        const auto& [gatetype, _] = values;
        const auto& [x, y] = placement.at(name);
        (draw.*DRAWFN.at(gatetype))({x*5,y*5});
    }
    // draw powergrid
    for (size_t y = 25; y < canvas_h*5; y+=200)
        draw.line({0,y,canvas_w*5,y},{.width=2});
    // draw wires
    for (const auto& wire : wires) {
        auto [x1, y1] = wire.front();
        for (const auto& [x2, y2] : wire | std::views::drop(1)) {
            draw.line({x1*5, y1*5, x2*5, y2*5},{.fill=0,.width=2});
            std::tie(x1, y1) = std::tuple{x2, y2};
        }
    }
    // draw solder joints
    for (const auto& [x, y] : solders) {
        draw.ellipse({x*5-2, y*5-2, x*5+2, y*5+2},{.fill=0,.outline=0,.width=1});
    }
    // draw port labels
    for (const auto& [sink, values] : gates) {
        const auto& [sinkgatetype, inports] = values;
        for (const auto& [i, source] : std::views::enumerate(inports)) {
            if (inputs.contains(source)) {
                auto [x, y] = placement.at(sink) + PORT_OFFSET.at(sinkgatetype)[i];
                y = y + 2;
                draw.text({x*5-12, y*5-8}, names[source], "rt");
                draw.line({x*5-10,y*5-2,x*5-6,y*5-2},   {.fill=0,.width=1});
                draw.line({x*5-10,y*5-1,x*5-4,y*5-1},   {.fill=0,.width=1});
                draw.line({x*5-10,y*5-0,x*5-2,y*5-0},   {.fill=0,.width=1});
                draw.line({x*5-10,y*5+1,x*5-4,y*5+1},   {.fill=0,.width=1});
                draw.line({x*5-10,y*5+2,x*5-6,y*5+2},   {.fill=0,.width=1});
            }
        }
    }
    for (const auto& output : outputs) {
        const auto& gatetype = gates.at(output).first;
        auto [x, y] = placement.at(output) + PORT_OFFSET.at(gatetype).back();
        draw.text({x*5+10, y*5-8}, names[output], "lt");
        draw.line({x*5+8,y*5-2,x*5+4,y*5-2},   {.fill=0,.width=1});
        draw.line({x*5+8,y*5-1,x*5+2,y*5-1},   {.fill=0,.width=1});
        draw.line({x*5+8,y*5-0,x*5+0,y*5-0},   {.fill=0,.width=1});
        draw.line({x*5+8,y*5+1,x*5+2,y*5+1},   {.fill=0,.width=1});
        draw.line({x*5+8,y*5+2,x*5+4,y*5+2},   {.fill=0,.width=1});
    }
    draw.write_png(schematic_filename);
}

int main(int argc, char *argv[]) {
    auto [inputs, outputs, gates, names] = parse(argc, argv);
    if (gates.empty()) return 0;
    auto placement = place(inputs, outputs, gates);
    for (double slackness = 1.0; slackness < 10.; slackness *= 1.05) {
        std::cout << "slackness: " << slackness << std::endl;
        decltype(placement) micro_placement(placement);
        auto [canvas_w, canvas_h] = snap_to_microgrid(micro_placement, gates);
        auto [cost, wires, solders] = route(inputs, outputs, gates, micro_placement, canvas_w, canvas_h);
        if (cost < 0) continue;
        std::cout << "total wire length: " << cost << std::endl;
        draw_schematic(inputs, outputs, gates, names, wires, solders, micro_placement, canvas_w, canvas_h);
        std::cout << "schematic generated." << std::endl;
        return 0;
    }
    std::cout << "error: all routing attempts failed, no schematic generated." << std::endl;
    return 0;
}

Here’s an analysis of the code by DeepSeek-R1:

Expand analysis

This program is a tool for generating standard CMOS circuit schematics, supporting BLIF format input and PNG image output. Below is an interpretation of its core functionality and workflow:

Input Parsing (parse function)

Functionality: Parses the BLIF file to extract circuit input/output ports and logic gate information.

Key Data Structures:

inputs/outputs: Store input/output port IDs.

gates: A hash table storing logic gates, with gate IDs as keys and gate types (e.g., NOT, NAND) and their connected ports as values.

names: Maps gate type names to unique IDs, supporting dynamic expansion.

Processing Logic:

Parses .inputs, .outputs, and .subckt sections line by line to extract port and subcircuit information.

Uses the parse_name function to convert string-type gate names to integer IDs for subsequent processing.

Placement Optimization (place function)

Algorithm: Uses simulated annealing for placement optimization, aiming to minimize routing cost and area.

Key Parameters:

initial_temperature: Controls the probability of accepting worse solutions early in the algorithm.

cooling_rate: Gradually reduces temperature to converge to an optimal solution.

area_cost and cross_track_penalty: Penalize area usage and cross-track routing, respectively.

Workflow:

Initial Placement: Randomly generates gate positions, ensuring no overlaps.

Iterative Optimization: Adjusts gate positions, calculates new layout costs, and accepts or rejects solutions based on the Metropolis criterion.

Termination: Stops when temperature drops below a threshold or after a timeout.

Routing (route function)

Objective: Generates paths for connections between logic gates while avoiding conflicts.

Key Steps:

Obstacle Marking: Uses the mask array to mark non-routable areas (e.g., space around gates).

Path Search: Employs BFS to search backward from target ports to source ports, recording paths.

Cost Calculation: Accumulates path length and cross-track penalties.

Output:

paths: Stores coordinates of all connection paths.

solders: Records solder point positions.

Drawing (draw_t struct)

Functionality: Renders the placement and routing results as a PNG image.

Core Methods:

Basic Shapes: line, ellipse for drawing lines and solder points.

Logic Gate Drawing: Functions like nmos, pmos, inv, nand implement CMOS gate rendering.

Text Labels: The text function uses predefined font bitmaps to label port names.

PNG Generation: Manually constructs the PNG file format (including IHDR, IDAT, IEND chunks), combines DEFLATE compression and CRC checks, and outputs the final image.

Main Workflow (main function)

Input Parsing: Reads the BLIF file and extracts circuit information.

Placement Optimization: Calls place to generate the initial layout.

Grid Alignment: Uses snap_to_microgrid to align layout coordinates to a micro-grid.

Multiple Routing Attempts: Incrementally increases slackness (slackness) and calls route until successful.

Image Generation: After successful routing, calls draw_schematic to render the schematic and save it as a PNG.

Key Technologies and Challenges

Simulated Annealing: Balances global search and local optimization to avoid local minima.

Pathfinding Algorithm: Efficiently finds feasible paths in dynamic obstacle environments.

Low-Level Drawing: Directly manipulates pixel data and generates PNG-compliant files, handling compression and checksum details.

Summary
This program automates the generation of schematics from logical descriptions by parsing circuit definitions, optimizing placement, performing automatic routing, and rendering. Its core strength lies in combining algorithmic optimization (simulated annealing) with low-level graphics processing, making it suitable for rapid visualization of small-to-medium-scale CMOS circuits.

Usage

The input is a netlist in BLIF format. You can use the open-source tool Yosys to synthesize it from Verilog.

For example, I first asked DeepSeek to write a 4-bit adder as an example:

adder4.v

module adder4 (  
    input [3:0] a,      // 4-bit input a  
    input [3:0] b,      // 4-bit input b  
    input cin,         // Carry input  
    output [3:0] sum,  // 4-bit sum output  
    output cout        // Carry output  
);  

// Implement the adder using continuous assignment  
assign {cout, sum} = a + b + cin;  

endmodule

Next, create a standard cell library. Currently, the code only supports NOT and NAND gates, so we need to instruct Yosys to use only these two cell types.

nandcell.lib

library(demo) {  
  cell(NOT) {  
    area: 3;  
    pin(A) { direction: input; }  
    pin(Y) { direction: output;  
              function: "A'"; }  
  }  
  cell(NAND) {  
    area: 4;  
    pin(A) { direction: input; }  
    pin(B) { direction: input; }  
    pin(Y) { direction: output;  
             function: "(A*B)'"; }  
  }  
}

Launch Yosys and run the following commands:

read_verilog adder4.v  
hierarchy -check  
abc -liberty nandcell.lib  
synth  
splitnets -ports  
opt  
write_blif -top adder4

The last step outputs the netlist in BLIF format. Copy and save it to a file named adder4.blif.

adder4.blif

# Generated by Yosys 0.51+104 (git sha1 c08f72b80, g++ 13.3.0-6ubuntu2~24.04 -fPIC -O3)  

.model adder4  
.inputs a[0] a[1] a[2] a[3] b[0] b[1] b[2] b[3] cin  
.outputs sum[0] sum[1] sum[2] sum[3] cout  
.names $false  
.names $true  
1  
.names $undef  
.subckt NOT A=a[3] Y=$abc$170$new_n15  
.subckt NOT A=b[3] Y=$abc$170$new_n16  
.subckt NOT A=a[2] Y=$abc$170$new_n17  
.subckt NOT A=b[2] Y=$abc$170$new_n18  
.subckt NOT A=a[1] Y=$abc$170$new_n19  
.subckt NOT A=b[1] Y=$abc$170$new_n20  
.subckt NOT A=a[0] Y=$abc$170$new_n21  
.subckt NOT A=b[0] Y=$abc$170$new_n22  
.subckt NOT A=cin Y=$abc$170$new_n23  
.subckt NAND A=a[3] B=b[3] Y=$abc$170$new_n24  
.subckt NAND A=a[2] B=b[2] Y=$abc$170$new_n25  
.subckt NAND A=$abc$170$new_n17 B=$abc$170$new_n18 Y=$abc$170$new_n26  
.subckt NAND A=$abc$170$new_n25 B=$abc$170$new_n26 Y=$abc$170$new_n27  
.subckt NOT A=$abc$170$new_n27 Y=$abc$170$new_n28  
.subckt NAND A=a[1] B=b[1] Y=$abc$170$new_n29  
.subckt NAND A=a[0] B=b[0] Y=$abc$170$new_n30  
.subckt NAND A=$abc$170$new_n21 B=$abc$170$new_n22 Y=$abc$170$new_n31  
.subckt NAND A=$abc$170$new_n30 B=$abc$170$new_n31 Y=$abc$170$new_n32  
.subckt NOT A=$abc$170$new_n32 Y=$abc$170$new_n33  
.subckt NAND A=cin B=$abc$170$new_n33 Y=$abc$170$new_n34  
.subckt NAND A=$abc$170$new_n30 B=$abc$170$new_n34 Y=$abc$170$new_n35  
.subckt NOT A=$abc$170$new_n35 Y=$abc$170$new_n36  
.subckt NAND A=$abc$170$new_n19 B=$abc$170$new_n20 Y=$abc$170$new_n37  
.subckt NAND A=$abc$170$new_n29 B=$abc$170$new_n37 Y=$abc$170$new_n38  
.subckt NOT A=$abc$170$new_n38 Y=$abc$170$new_n39  
.subckt NAND A=$abc$170$new_n35 B=$abc$170$new_n39 Y=$abc$170$new_n40  
.subckt NAND A=$abc$170$new_n29 B=$abc$170$new_n40 Y=$abc$170$new_n41  
.subckt NOT A=$abc$170$new_n41 Y=$abc$170$new_n42  
.subckt NAND A=$abc$170$new_n28 B=$abc$170$new_n41 Y=$abc$170$new_n43  
.subckt NAND A=$abc$170$new_n25 B=$abc$170$new_n43 Y=$abc$170$new_n44  
.subckt NOT A=$abc$170$new_n44 Y=$abc$170$new_n45  
.subckt NAND A=$abc$170$new_n15 B=$abc$170$new_n16 Y=$abc$170$new_n46  
.subckt NAND A=$abc$170$new_n24 B=$abc$170$new_n46 Y=$abc$170$new_n47  
.subckt NOT A=$abc$170$new_n47 Y=$abc$170$new_n48  
.subckt NAND A=$abc$170$new_n44 B=$abc$170$new_n48 Y=$abc$170$new_n49  
.subckt NAND A=$abc$170$new_n24 B=$abc$170$new_n49 Y=cout  
.subckt NAND A=$abc$170$new_n23 B=$abc$170$new_n32 Y=$abc$170$new_n51  
.subckt NAND A=$abc$170$new_n34 B=$abc$170$new_n51 Y=$abc$170$new_n52  
.subckt NOT A=$abc$170$new_n52 Y=sum[0]  
.subckt NAND A=$abc$170$new_n36 B=$abc$170$new_n38 Y=$abc$170$new_n54  
.subckt NAND A=$abc$170$new_n40 B=$abc$170$new_n54 Y=$abc$170$new_n55  
.subckt NOT A=$abc$170$new_n55 Y=sum[1]  
.subckt NAND A=$abc$170$new_n27 B=$abc$170$new_n42 Y=$abc$170$new_n57  
.subckt NAND A=$abc$170$new_n43 B=$abc$170$new_n57 Y=$abc$170$new_n58  
.subckt NOT A=$abc$170$new_n58 Y=sum[2]  
.subckt NAND A=$abc$170$new_n45 B=$abc$170$new_n47 Y=$abc$170$new_n60  
.subckt NAND A=$abc$170$new_n49 B=$abc$170$new_n60 Y=$abc$170$new_n61  
.subckt NOT A=$abc$170$new_n61 Y=sum[3]  
.end

Compile cirschem.cpp: g++ cirschem.cpp -std=c++26 -O3

Run: ./a.out adder4.blif

After waiting about half a minute for placement and routing to complete, the output will be saved as schematic.png.

Cambricon-C: Efficient 4-Bit Matrix Unit via Primitivization

2024-11-02 paper conference micro

(ZH-EN Translation by DeepSeek-V3-0324)

With the rise of large language models, 4-bit quantized models have become the preferred choice for deployment due to computational and storage budget constraints. However, we observe that traditional computing architectures suffer from significant computational redundancy when handling low-precision matrix multiplication.
We propose the Cambricon-C architecture, which aims to revolutionize the implementation of 4-bit matrix computation through an innovative Primitivized Matrix Multiplication (PMM) algorithm.

An in-depth analysis of 4-bit matrix multiplication reveals astonishing redundancy: in the 4-bit quantized version of Llama2-7B, computing the activation value of a single output neuron requires 11,008 4-bit multiplications followed by accumulation. Since 4-bit data has at most 16 possible values, the pigeonhole principle dictates that over 97% of these multiplications must be repeated. Traditional matrix processors based on MAC units redundantly compute these identical products, resulting in enormous energy waste.

The PMM algorithm proposes: By leveraging the inverse distributive property of multiply-accumulate operations, we first count the occurrences of each multiplication and then perform a unified weighted summation, significantly reducing computational intensity.

Precomputation to Eliminate Redundant Multiplications: Precompute all possible 4-bit product combinations (256 in total) as a lookup table. During actual computation, only table lookup is required instead of repeated calculations.
Reduced Addition Intensity: Transform accumulation operations into incremental updates of 256 counters, replacing traditional adder trees with unary successor operations (i.e., counting), drastically cutting addition overhead.

Based on the PMM algorithm, we designed the Cambricon-C architecture:

Quarter-Square Multiplication (QSM): We revisited the 19th-century quarter-square multiplication technique, reducing the number of counters from 256 to 29 (referred to as the “R29 scheme”), achieving an 8.6× improvement in area efficiency.
Ripple Counters: Implemented counters using self-timed D flip-flop chains, reducing dynamic power by 51% compared to SRAM-based solutions.

Experimental results show:

A single PE achieves 1.97× higher energy efficiency than traditional MAC units.
A full 32×32 array accelerator delivers a 1.25× end-to-end energy efficiency improvement for LLaMA2 inference.
Performance advantages scale with matrix size.

Cambricon-C pioneers a new paradigm of “primitivized computation”. By decomposing complex operations into fundamental counting operations, we redefine the efficiency limits of low-precision computing:
The future efficiency of accelerators will no longer depend on intricate MAC unit designs but rather on holistic optimization of entire matrix operations.
In the near future, we will publish more work on “primitivized computation”, continuing to drive innovation in the microarchitecture of deep learning processors. We welcome attention and collaboration from peers.

Published at MICRO 2024. [DOI] [PDF]

The Application of Skew Ternary Numbers in Gomoku AI Programs

2024-05-25 technotes

Problem

Given a Gomoku board situation, identify all threat points. Threat points refer to the positions where both sides can form a winning situation after placing a piece, such as the end of a four-in-a-row (winning, immediately forming a five-in-a-row), and both ends of a living three (forming a living four after placing a piece, if the opponent does not win in this step, it will be a winning situation in the next turn). For example, the positions marked with “x” in the following board.

┌┬┬┬┬┬┬┬┬┬┬┬┬┬┐

├┼┼┼┼┼┼┼┼┼┼┼┼┼┤

├┼┼┼┼┼┼┼┼┼┼x┼┼┤

├┼┼┼┼┼┼┼┼┼●┼┼┼┤

├┼┼┼x○○○○●┼┼┼┼┤

├┼┼┼┼┼┼┼●┼┼┼┼┼┤

├┼┼┼┼┼┼x┼┼┼┼┼┼┤

├┼┼┼┼┼┼┼┼┼┼┼┼┼┤

└┴┴┴┴┴┴┴┴┴┴┴┴┴┘

During the search of AI programs, threat points must be quickly found in order to prune other moves. Because placing a piece at other positions when there is a threat point will always lead to a disadvantage.

Naive Method

A simple solution is to traverse according to the rules. I believe most people solve it this way when they first start writing programs, and I am no exception. After all, premature optimization is evil. But eventually, you will find that the speed of this process is crucial.

The first solution that comes to mind is vectorization. Use a bitboard to represent the situation, dividing the board into black and white, each board with 225 bits, which can be placed in uint16_t [16], or a YMM register __m256i. Through AVX instructions such as _mm256_slli_epi16/_mm256_srli_epi16 (VPSLLW/VPSRLW) to shift and align the entire board, and then perform logical AND to find living four, blocked four, living three, etc. This saves the overhead of enumerating the board, but due to the complexity of the threatening shapes (blocked four and living three are not necessarily continuous), the program is not simple and still takes a relatively long time.

Ternary Coding

In 2014, I independently discovered a solution and applied it to the evaluation function of kalscope v2 “Dolanaar”. This program was a practice work for me when I was an undergraduate, but it also achieved competitive strength at that time.

I treat each Yang line (horizontal or vertical line) on the board separately. There are 15 places on a line, I list all possible situations, calculate the threat points in advance to form a lookup table. However, if represented by a bitboard, the encoding of a line can reach , and the size of the lookup table reaches 2 GB. Note that there is a lot of redundancy in the bitboard encoding, with 2 bits encoding 3 situations: empty “┼” 0b00, white piece “○” 0b01, black piece “●” 0b10, and 0b11 is illegal. To achieve compact encoding, ternary counting is needed, 0 represents empty, 1 represents white piece, 2 represents black piece, each ternary digit represents a place.

The conversion to ternary involves a lot of division and modulus operations, thus it is unrealistic. Note that the board situation does not appear suddenly, but is always played gradually from an empty board, and ternary representation can be incrementally maintained. The empty line is encoded as 0; placing a white stone at the -th position is equal to adding to the encoding; placing a black piece at the -th position is equal to adding to the encoding. On backtrack, subtract the increased value from encoding. To implement efficiently, we only need to keep another lookup table of the powers of three, power3.

constexpr int power3[16] {
 1, 3, 9, 27, 81, 243, 729, 2187, 6561, 19683, 59049, 177147, 531441, 1594323, 4782969, 14348907
};
int triboard_h[15] {};
int triboard_v[15] {};
uint16_t threat_lut[14348907] { ... }; // pre-generated.

void make(int x, int y, int color) {
    triboard_h[y] += power3[x] << color;
    triboard_v[x] += power3[y] << color;
}
void unmake(int x, int y, int color) {
    triboard_h[y] -= power3[x] << color;
    triboard_v[x] -= power3[y] << color;
}
void find_threat() {
    for (int y = 0; y < 15; y++) {
        uint16_t threat = threat_lut[triboard_h[y]];
        while (threat) {
            int x = std::countr_zero(threat);

            // do something to the threat (x,y).

            threat &= threat - 1;
        }
    }
    // another loop for vertical lines.
    // TODO: how about diagonal / anti-diagonal lines?
}

The lookup table threat_lut tightly encodes all possible situations of a Yang line on a 15-line board, occupying only 27.4 MB of memory.

This method is designed specifically for a standard 15-line board, but at that time, the Gomocup rules were using 20-line boards, which made this method invalid. Otherwise, I might have opted in the competition. I am not sure if anyone else has discovered a similar method earlier by me, but I did notice that Rapfi, recent champion of Gomocup, also implemented this technique in it.

This method is not perfect. It is only for a fixed-length Yang line, and using it for the 20-line rules of Gomocup will result in an oversized LUT. In addition, the length of each Yin line (diagonal or anti-diagonal line) is different, and if you forcibly apply the lookup table for the Yang line to Yin lines, it will lead to false positives. For example, on a Yin line with a total length of only 4 places, a living three is not a threat because the board boundary will naturally block it. How should this be handled?

I did not give an answer in 2014, but left an unsolved problem.

Skew Ternary Coding

Last year, I guided doctoral student Guo, Hongrui to complete Cambricon-U, which implemented the in-place counting in SRAM using skew binary numbers. Skew numbers can not only be used for binary, but also for ternary, solving the abovementioned problem.

Skew ternary allows a digit “3” to temporarily exist in the number without carrying over. When counting, if there is no digit “3” existing, increment the least significant digit by 1; If there is a digit “3”, reset the “3” to “0” and carry over to the next digit. Under such counting rules, it is easy to find that skew ternary has the following two properties:

There is at most one digit “3” in the number;
If there is a “3”, all less significant digits must be “0”.

We use “0” to represent an empty place, “1” to represent a white stone, “2” to represent a black stone, and “3” to represent the board boundary. Such a coding can naturally and compactly encode various lengths of Yin lines into the lookup table. When the length of the Yin line is less than 15, only the high digits are used, and the boundary is set to “3”, and the less significant digits are “0”. The items in the lookup table are compiled in the order of skew ternary counting:

Lookup Table

Decimal	Skew Ternary	Line Situation	Threats
0	000000000000000	├┼┼┼┼┼┼┼┼┼┼┼┼┼┤	`0b000000000000000`
1	000000000000001	├┼┼┼┼┼┼┼┼┼┼┼┼┼○	`0b000000000000000`
2	000000000000002	├┼┼┼┼┼┼┼┼┼┼┼┼┼●	`0b000000000000000`
3	000000000000003	├┼┼┼┼┼┼┼┼┼┼┼┼┤▒	`0b000000000000000`
4	000000000000010	├┼┼┼┼┼┼┼┼┼┼┼┼○┤	`0b000000000000000`
5	000000000000011	├┼┼┼┼┼┼┼┼┼┼┼┼○○	`0b000000000000000`
6	000000000000012	├┼┼┼┼┼┼┼┼┼┼┼┼○●	`0b000000000000000`
7	000000000000013	├┼┼┼┼┼┼┼┼┼┼┼┼○▒	`0b000000000000000`
8	000000000000020	├┼┼┼┼┼┼┼┼┼┼┼┼●┤	`0b000000000000000`
9	000000000000021	├┼┼┼┼┼┼┼┼┼┼┼┼●○	`0b000000000000000`
10	000000000000022	├┼┼┼┼┼┼┼┼┼┼┼┼●●	`0b000000000000000`
11	000000000000023	├┼┼┼┼┼┼┼┼┼┼┼┼●▒	`0b000000000000000`
12	000000000000030	├┼┼┼┼┼┼┼┼┼┼┼┤▒▒	`0b000000000000000`
13	000000000000100	├┼┼┼┼┼┼┼┼┼┼┼○┼┤	`0b000000000000000`
14	000000000000101	├┼┼┼┼┼┼┼┼┼┼┼○┼○	`0b000000000000000`
15	000000000000102	├┼┼┼┼┼┼┼┼┼┼┼○┼●	`0b000000000000000`
16	000000000000103	├┼┼┼┼┼┼┼┼┼┼┼○┤▒	`0b000000000000000`
17	000000000000110	├┼┼┼┼┼┼┼┼┼┼┼○○┤	`0b000000000000000`
18	000000000000111	├┼┼┼┼┼┼┼┼┼┼┼○○○	`0b000000000000000`
19	000000000000112	├┼┼┼┼┼┼┼┼┼┼┼○○●	`0b000000000000000`
20	000000000000113	├┼┼┼┼┼┼┼┼┼┼┼○○▒	`0b000000000000000`
21	000000000000120	├┼┼┼┼┼┼┼┼┼┼┼○●┤	`0b000000000000000`
22	000000000000121	├┼┼┼┼┼┼┼┼┼┼┼○●○	`0b000000000000000`
23	000000000000122	├┼┼┼┼┼┼┼┼┼┼┼○●●	`0b000000000000000`
24	000000000000123	├┼┼┼┼┼┼┼┼┼┼┼○●▒	`0b000000000000000`
25	000000000000130	├┼┼┼┼┼┼┼┼┼┼┼○▒▒	`0b000000000000000`
26	000000000000200	├┼┼┼┼┼┼┼┼┼┼┼●┼┤	`0b000000000000000`
27	000000000000201	├┼┼┼┼┼┼┼┼┼┼┼●┼○	`0b000000000000000`
28	000000000000202	├┼┼┼┼┼┼┼┼┼┼┼●┼●	`0b000000000000000`
29	000000000000203	├┼┼┼┼┼┼┼┼┼┼┼●┤▒	`0b000000000000000`
30	000000000000210	├┼┼┼┼┼┼┼┼┼┼┼●○┤	`0b000000000000000`
31	000000000000211	├┼┼┼┼┼┼┼┼┼┼┼●○○	`0b000000000000000`
32	000000000000212	├┼┼┼┼┼┼┼┼┼┼┼●○●	`0b000000000000000`
33	000000000000213	├┼┼┼┼┼┼┼┼┼┼┼●○▒	`0b000000000000000`
34	000000000000220	├┼┼┼┼┼┼┼┼┼┼┼●●┤	`0b000000000000000`
35	000000000000221	├┼┼┼┼┼┼┼┼┼┼┼●●○	`0b000000000000000`
36	000000000000222	├┼┼┼┼┼┼┼┼┼┼┼●●●	`0b000000000000000`
37	000000000000223	├┼┼┼┼┼┼┼┼┼┼┼●●▒	`0b000000000000000`
38	000000000000230	├┼┼┼┼┼┼┼┼┼┼┼●▒▒	`0b000000000000000`
39	000000000000300	├┼┼┼┼┼┼┼┼┼┼┤▒▒▒	`0b000000000000000`
40	000000000001000	├┼┼┼┼┼┼┼┼┼┼○┼┼┤	`0b000000000000000`
41	000000000001001	├┼┼┼┼┼┼┼┼┼┼○┼┼○	`0b000000000000000`
42	000000000001002	├┼┼┼┼┼┼┼┼┼┼○┼┼●	`0b000000000000000`
43	000000000001003	├┼┼┼┼┼┼┼┼┼┼○┼┤▒	`0b000000000000000`
44	000000000001010	├┼┼┼┼┼┼┼┼┼┼○┼○┤	`0b000000000000000`
45	000000000001011	├┼┼┼┼┼┼┼┼┼┼○┼○○	`0b000000000000000`
…	…	…	…
138	000000000010110	├┼┼┼┼┼┼┼┼┼○┼○○┤	`0b000000000001000`
…	…	…	…
174	000000000011100	├┼┼┼┼┼┼┼┼┼○○○┼┤	`0b000000000100010`
175	000000000011101	├┼┼┼┼┼┼┼┼┼○○○┼○	`0b000000000000010`
176	000000000011102	├┼┼┼┼┼┼┼┼┼○○○┼●	`0b000000000100000`
177	000000000011102	├┼┼┼┼┼┼┼┼┼○○○┤▒	`0b000000000100000`
…	…	…	…
21523354	222223000000000	●●●●●▒▒▒▒▒▒▒▒▒▒	`0b000000000000000`
21523355	222230000000000	●●●●▒▒▒▒▒▒▒▒▒▒▒	`0b000000000000000`
21523356	222300000000000	●●●▒▒▒▒▒▒▒▒▒▒▒▒	`0b000000000000000`
21523357	223000000000000	●●▒▒▒▒▒▒▒▒▒▒▒▒▒	`0b000000000000000`
21523358	230000000000000	●▒▒▒▒▒▒▒▒▒▒▒▒▒▒	`0b000000000000000`

The lookup table has a total of 21523359 items, tightly encoding all situations of length-15 Yang lines and Yin lines from length 1 to 15, occupying 41.1 MB of memory. The code is slightly modified, changing the ternary weight power3 to the skew ternary weight basesk3, which is the sequence OEIS A261547.

constexpr int basesk3[16] {
    1, 4, 13, 40, 121, 364, 1093, 3280, 9841, 29524, 88573, 265720, 797161, 2391484, 7174453, 21523360
};
int sk3board_h[15] {};
int sk3board_v[15] {};
int sk3board_d[29] {};
int sk3board_a[29] {};
uint16_t threat_lut[21523359] { ... }; // pre-generated.
   
// set board bounds on init. 
void init() {
    for (int i = 0; i < 15; i++) {
        sk3board_d[i] = basesk3[13-i] * 3;
        sk3board_a[i] = basesk3[13-i] * 3;
    }
    for (int i = 15; i < 29; i++)
        sk3board_d[i] = basesk3[i-15] * 3;
        sk3board_a[i] = basesk3[i-15] * 3;
    }
}

void make(int x, int y, int color) {
    sk3board_h[y]      += basesk3[x]             << color;
    sk3board_v[x]      += basesk3[y]             << color;
    sk3board_d[14-x+y] += basesk3[x<y?y:x]       << color;
    sk3board_a[x+y]    += basesk3[x+y<15?14-y:x] << color;
}
void unmake(int x, int y, int color) {
    sk3board_h[y]      -= basesk3[x]             << color;
    sk3board_v[x]      -= basesk3[y]             << color;
    sk3board_d[14-x+y] -= basesk3[x<y?y:x]       << color;
    sk3board_a[x+y]    -= basesk3[x+y<15?14-y:x] << color;
}
void find_threat() {
    for (int y = 0; y < 15; y++) {
        uint16_t threat = threat_lut[sk3board_h[y]];
        while (threat) {
            int x = std::countr_zero(threat);

            // do something to the threat (x,y).

            threat &= threat - 1;
        }
    }
    // to process vertical, diagonal and anti-diagonal lines
}

We have completely transformed the very cumbersome threat point enumerating problem into incrementally maintaining the board representation in four directions, and then performing table lookups. A lookup table of about 41 MB can now as a whole fit in the L3 cache, thus the speed is extremely fast. In the above example, find_threat can be further vectorized, and the overhead of the entire operation can be compressed to about a hundred clock cycles.

The same idea can be used for winning checks, evaluation functions, etc. If you are preparing to use an architecture similar to AlphaGo, you might as well try to change the first layer of the neural network to 1D convolution along the four lines, and this method can directly save the first layer convolution operation with lookup.

Thoughts on PIM after Cambricon-U

2023-10-28 technotes

Quick thoughts after Cambricon-U.

RIM shows our perspective of PIM as researchers who did not studied in the field. When it comes to the concept of PIM, people intuitively first thought of the von Neumann bottleneck, and believe that PIM is of course the only way to solve the bottleneck. In addition to amateur CS enthusiasts, there are also professional scholars who write various materials in this caliber.

This is not true.

In RIM’s journey, we showed why previous PIM devices could not do counting. Counting (1-ary successor function) is the primitive operation that defined Peano arithmetic, and is the very basis of arithmetic. We implemented the counting function in PIM for the first time by using a numeral representation that is unfamiliar to scholars in the field. But this kind of trick can only be done once, and the nature will not allow us for a second time, because it is proven that no representation can efficiently implement both addition and multiplication*.

Computation implies flowing of information, while PIM wants them to stay in place. Contradictory. Combined with Mr. Yao’s conclusion, we may not be able to efficiently complete all the basic operations in-place in storage. If I were to predict, PIM would never really have anything to do with the von Neumann bottleneck.

* Andrew C. Yao. 1981. The entropic limitations on VLSI computations (Extended Abstract). STOC’81. [DOI]

Cambricon-U：A Systolic Random Increment Memory

2023-10-28 paper conference micro

At the MICRO 2023 conference, we demonstrated the invention of a new type of process-in-memory device “Random Increment Memory (RIM)” and applied it to the stochastic computing (unary computing) architecture.

In academia, PIM is currently a hot research field, and researchers often place great hopes on breaking through the von Neumann bottleneck with subversive architecture. However, the concept of PIM has gradually converged to specifically refer to the use of new materials and devices such as memristors and ReRAM for matrix multiplication. In 2023, from the perspective of outsiders like us who have not studied in PIM before, PIM has developed into a strange direction: yet developed PIM devices can process neural networks and simulate brains, but still unable to do the most basic operations such as counting and addition. In the research of Cambricon-Q, we left a regret: we designed NDPO to complete the weight update on the near-memory side, but it was unable to achieve true in-memory accumulation, so the reviewer criticized “it can only be called near-memory computing, not in-memory computing.” From this, we began to think about how to complete in-place accumulation in the memory.

We quickly realized that addition could not be done in-place with binary representation. This is because although addition is simple, there is carry propagation: even adding 1 to the value in memory may cause all bits to flip. Therefore, a counter requires a full-width adder to complete the self-increment operation (1-ary successor function), thus all data bits in the memory must be activated for potential usage.

Fortunately, the self-increment operation results in only two bit flips on average. We need to find a numeral representation that limits the number of bits flipping in the worst case. Therefore we introduced Skew binary number system to replace binary numbers. The skew binary was originally proposed for building new data structures, such as the Brodal heap, to limit the worst-case time complexity when a heap merges. It is very similar to the case here, that is, limiting carry propagation.

We base on SRAM in conventional CMOS technology to design RIM. We store digits (in skew binary) in column direction, and use an additional column of SRAM cells to store the skew bit of each digit (that is, where the digit “2” is in the skew number). The self-increment operations are performed as follows:

If any digit in the current skew number is “2”, set “2” to “0” and increase the next digit by 1;
Otherwise, increase the least significant digit by 1.

Although the row-index of “2” in each skew number are different (the skew counting rules promise that at most one digit can be “2”), the cells to be operated on will be randomly distributed in the memory array. It cannot be activated according to the row selection of SRAM. But we can use the latched skew bit to activate the corresponding cells to operate, so that cells located in different rows can be activated in the same cycle!

Finally, we achieved a 24T RIM cell. Not using new materials, but built entirely from CMOS. RIM can process random self-increment on stored data: in the same cycle, each data can self-increment or remain unchanged on demand.

We apply RIM in stochastic computation (unary computation). A major dilemma in stochastic computing is the cost of conversion between unary numbers and binary numbers. Converting binary numbers to unary numbers requires a random number generator, and converting back requires a counter. Because unary numbers are long (up to thousands of bits), after computing in unary, the intermediate results must be converted back to binary for buffering. As a result, counting operations can account for 78% energy consumption. We use RIM to replace the counters in uSystolic and propose the Cambricon-U architecture, which significantly reduced the energy consumption of counting operations. This work solves a key problem of stochastic deep learning processors, boosting the application of these technologies.

Published on MICRO 2023. [DOI] [PDF]

Yongwei's ten rules on thesis writing

2023-06-01 technotes

Note: translated with the help from ChatGLM3, minimalist revision only. The quality of translation may vary.

Spent the whole of May helping people to refine their theses.

In our educational system, no one has ever taught students how to write a thesis before they start writing it. This has led to many problems: students don’t know how to write, and advisors have a hard time correcting them; they keep revising and revising, and after a few months, still get arrested during defense. The vast majority of the problems with the theses are common, but they need to be taught one by one, which is time-consuming and labor-intensive, and the results are not very good. I have summarized some common problems for those who may need it, for self-examination and improvement.

In order to make it clear to students who have no experience in writing, the problems I have listed are all very surface-level, simple, and formulaic. They can all be quantitatively analyzed by running just a few lines of script. Please note: out of these phenomena does not mean that the thesis will pass, and vice versa does not necessarily mean that it is not an excellent thesis. I hope to inspire readers to think for themselves about why most problematic theses are like this.

Law 1: The frequency of the term “extremely” () is negatively correlated with the quality of the thesis.

Alternative terms to “extremely” can certainly be used in scientific writings for “in limits”. But more often, the use of words like “extreme” “very” “significantly” and so on in a thesis is just to emphasize. In language, these words serve to express emotions and are subjective, providing little help in objectively describing things. However, the thesis, as the carrier of science, must remain objective. When writing these subjective descriptions, students need to be especially cautious, and should carefully check whether they have deviated from the basic requirement of maintaining objectivity. In general, when someone makes such a mistake, they tend to commit a string of similar errors, which can leave reviewers with an unforgettable experience of unprofessionalism.

Similarly, it is recommended that students check whether other forms of subjective evaluations, such as “widely applied” or “made significant contributions” appear in the thesis, or words to that effect. These are comments that only reviewers can say, as reviewing requires expert opinions. The authors of the thesis should discuss objective rules, not to advertising themselves; unless they provide a clear and objective basis, they also have no qualifications to talk about others.

The solution is to provide evidences. For example, “high efficiency” can be changed to “achieved a 98.5% efficiency in a (particular) test”, and “widely applied” can be changed to “applied in the fields of sentiment analysis [17], dialogue systems [18], and abstract generation [19].”

Law 2: The variance of the number of fields () in the references is negatively correlated with the quality of the thesis.

You are too lazy to deal with the references, so you copy the BiBTeX entries from DBLP, ACM DL, or the publisher’s website and paste them together, leaving the formatting of the reference section to the control of a random bst file. The end result is that the quality of the references varies: some fields are missing, and some have many fields, meaningless. Therefore, I say counting the number of fields in each reference, and calculating the variance, can roughly reflect the quality of the writing. The issues in the reference are truly intricated, and it takes endless time to clean them up. So it can reflect the author’s level of payment into the thesis, quite noteworthy. For the same conference, some only write the full conference name, some add the organizer in front, some add an abbreviation in the end, some use the year, and some use the session sequence. Already with an DOI, but some give an arXiv URL at the end. The title has mixed capitalization. Sometimes there are even a hundred different ways to write “NVIDIA”. Therefore, how carefully the paper is written, how much time is spent polishing, can be clearly seen from the references. During the degree defense, time is tight for the reviewers, they must be checking from the most weakness, you are impossible to fool around.

The solution is to put in the effort to go through it, which requires not being afraid of trouble and spending more time. The information provided by the publisher on the website can only be used as a reference. BiBTeX is just a formatting tool, the actual display always depends on the editing from author.

Law 3: The count of underscores (_) in the text () is negatively correlated with the quality of the thesis.

There is no such punctuation in Chinese, nor in English, and no human language AFAIK has this character in the vocabulary. The only chance to bring it into a thesis, is if the author treated the thesis as a coding documentation, and adding code snippets into the text. This law is the most accurate one, with almost no exceptions.

The purpose of writing is to convey your academic views to others. To efficiently convey to others, the thesis must first abstract and condense the knowledge. One should never present a bunch of raw materials and expect the readers to understand it on their own, otherwise, what is the point of having an author? The process of abstraction means leaving behind the implementation details that are irrelevant to the narrative, and retaining the ideas and techniques that are eye-catching. Amongst the details, code is the most gory. Therefore, none of a good thesis will contain code, and good authors will never have any concept with an underscore symbol in the name.

The solution is to complete the process of abstraction and the process of naming abstract concepts. If the code, data, and theorem proofs need to be retained, they should be included in the appendix, and should not be considered part of the thesis.

Law 4: The proportion of western characters () in the text is negatively correlated with the quality of a Chinese thesis.

It is specifically required to write in Chinese. But poor writers love to mix things up in the language. Law 3 is to criticize the mixing of code snippets (which are not in human language), and Law 4 is to criticize the mixing of all foreign expressions (which are not in Chinese language). I can understood why you keep “AlphaGo” (阿尔法围棋) untranslated, but I really cannot when you keep “model-based RL” (基于模型的强化学习). This year, two of the students wrote like this: “降低cache的miss率” (reduce the miss rate of cache), the theses are total mix-up. Come on, you are about to graduate, but cannot even speak your native language fluent! It is normal if you do not know how to express a concept in Chinese. If the concept is important, you should put some time into finding a common translation; if it’s a new concept, you should put some thought into coming up with a professional, accurate, and recognizable name; if you’re too lazy to find a name, it only shows that you don’t give a fuck to the bullshit you wrote.

I have a suggestion for those who found themselves having the bad habit of mixing languages: For every western character used in the writing, you should donate a dollar to charity. It is not prohibited to write in western, but it should let you feel some pain in the ass to avoid abusing.

Law 5: The average length of mathematical symbols () is negatively correlated with the quality of the thesis.

Einstein’s writing of equation:

The same equation, if relativity is found by our student:

Oh please, at the very least, always write text in text mode! Wrap in \text or \mathrm macros, otherwise it becomes 😅

Alright, let’s just appreciate the difference from the masters. Why do their works become timeless classics, while we’re still struggling to pass the thesis defense? When doing work, we didn’t take the time to simplify: text is full of redundant words, the images are intricated, and even the formula are lengthy. The higher the complexity, the worse the academic aesthetics left.

Law 6: The maximum number of charts appeared on a single page () is negatively correlated with the quality of the thesis.

When you can’t find enough pages, it’s always faster to spam charts. As soon as the students come to the experiment section, they pick up the profiler and take screenshots. Suddenly, a dozen of images in a row with dazzling blocks all over the screen weightened the thesis full and complete. As a result, the reviewers are fucked up: what the hell is that? Piles of colorful blocks, lacking accompanying text descriptions… Reading this is even more challenging than IQ test questionaires! Only the author himself and God knows the meaning, or actually more often, only the God knows. If the thesis is of decent quality, the reviewers just ignore the cryptic part to avoid making the meeting awkward. If not, the piled charts could be the last straw put on their tolerance.

If the value of is calculated to be relatively high, it at least indicates that some charts and text in the thesis is inproperly arranged, and is likely that there are some charts not adequately explained or referenced in the text. Keep in mind that the main body of the thesis is text, while charts and tables are merely auxiliary tools for intuitively understanding. When using charts and tables, it should be noted first that they are auxiliary to the text, not the other way around. Avoid continuous sequences of charts and tables, as it goes against the principle of clear communication.

Secondly, it is important to evaluate the intuitiveness of charts and tables. The meaning of the charts and tables should be clear and easy to understand at a glance. When charts and tables are complex, it’s important to highlight the key points and guide the reader’s attention. Alternatively, provide detailed captions or reading guidelines.

Law 7: The number of screen shots () is negatively correlated with the quality of the thesis.

In line with the previous Law: It is recommended to pull the “PrintScr” key out of your keyboard while writing – the Demon’s key. Due to the temptation of this key, students become lazy, and constantly take screenshots of software interfaces such as VCS, VS Code, or the command prompt, and insert them into the paper as figures. As a result, it often leads to issues such as excessive use of large or small fonts, mismatched fonts, inconsistent image styles, missing description text, lack of key points, and simple piling up of raw materials, etc. Although the issues may not be particularly serious, they will definitely leave a poor impression on readers. Whether the meaning of the charts is clear or not, they will first leave a sloppy impression.

Screenshots on the IDE, they will certainly be code! As per Law 3 and 4. Among all the captures, capturing the debugging logs from the terminal is the most untolerable. It’s still better to capture the code instead. Code, even not good for writing, still has well-defined syntax and semantics, while debugging logs are totally waste of paper and ink.

I always check this metric first, because it’s the most easy to: find the last page of the related works, stick a finger in, close the cover, and check the thickness of the two divisions. This only takes five seconds, but it constantly finds out who spam pages in the related work section. If the thesis writes too much in related works, it is always bound in poor quality, because writing related works is much more difficult than writing your own work! Just imagine, during the defense, the reviewers have to flip through your paper and listen to your presentation, dual-threaded in their minds. But on your side, you spend 20min introducing other’s works that are well-known, cliched, or unrelated before going into your own. The result must be terrible.

Survey of works requires a lot of devotions and academic expertises. Firstly, it’s necessary to gather and select important related work. Here we should not discriminate against institutions or publications, as the evaluation of specific work needs to be based on objective scientific assessment. However, exploratory work tends to (in a statistical sense) be published by leading institutions in important conferences and journals. Our students always cite papers from “Indian Institute of Engineering and Technology”, published on an open-access journal from Hindawi, while overlooking the fundamental works in the field. It is only showing themselves not yet on the cutting-edge of the field. It’s not easy to list a bunch of important related works while avoiding low-quality ones. This is a task that requires enduring academic expertise. Students who just start out is expected to fail short when trying to put too much pages in the related work section.

Summarizing the selected related works is a highly scientific work in itself. If done well, its value is no less than any research point in the thesis. Students don’t have so much time and energy to go through all the related works, or they haven’t been trained to summarize properly. So most of the time, they simply enumerate the works. Among those who have mastered it well, some can write like this:

** 2 Related Works **

2.1 Researches on A

A is important. There are many works on A.

Doe et al. proposed foo, and it is zesty.

Roe et al. proposed bar, and it is nasty.

Bloggs et al. proposed buz, and it is schwifty.

However, prior work sucks. Our work rocks.

2.2 Researches on B

B is also important…

Such writing methods at least ensure that the related works are listed with a beginning and an end, and that they are centered around the main work. It’s commendable that some students can write in this level, but there are many who can’t even manage to do that, and only list several works that are not even related to the main topic. If review experts have the intention to torment the author and make him/her look silly, they could ask questions like: “What’s the key technique behind this work?” “What’s the difference between this work and that one?” “How is this work related to the main topic?” After asking such questions one by one, the author would probably be so embarrassed. After listing so many related works, some of which the author may not have even opened before, it’s not uncommon for people to just copy and paste the BiBTeX entries directly from Google Scholar into the paper without really understanding what they mean.

The three questions I listed correspond to three essential elements of related works: key scientific concepts or discoveries, taxonomy, and relevance to the main topic. A good summary should achieve the following three goals: explain the key concepts or discoveries of each related work in a single sentence; organize related works into multiple categories, with clear boundaries; and ensure that the logical relationship between each category is clear and the narrative thread is evident, closely related to the main topic. This requires the author to have strong scientific abstraction and concise language abilities, which need a thorough drafting and refinement process to produce a coherent and clear writing.

For most students, summarizing three research points in the first chapter of a thesis is already a difficult task. So, I especially advise students not to introduce related works lightly in the second chapter, unless there is a special understanding, as it challenges their writing abilities. It’s better to specifically mention related works where they are needed and provide targeted explanations during the process of discussing each research point. Otherwise, students without scientific writing training might easily turn the related work chapter into a paragraph that doesn’t make much sense, just to spamming words. Since it’s a spam to meet the word count requirement, students may feel compelled to include more and more related work, even if it doesn’t fit well with the main topic. Lengthy content usually implies a lower quality of the thesis, indicating that there must be something wrong with the thesis.

Law 9: The length of the longest consecutive block of experimental content () is negatively correlated with the quality of the thesis.

Another area where spaming is heavily used is in the experimental section. After all, compared to writing the thesis, conducting experiments seems more comfortable. Adding unnecessary metrics, creating bar charts for several related works and compare them, all done with reused matplotlib scripts. Alternatively, you can list the data in a table, until not fit on the page. Additionally, there are many students who want to include three research points but can’t find enough space for them, so they extract the experiments and write them as the fifth chapter, the third research point in the thesis. As a separate chapter, the continuous experimental section is at least 20 pages in thickness, so as not to appear thin. The harm caused by listing experiments together with related work is just as serious. Experiments were originally used to verify research content, but now some research content has lost its experimental support. A bunch of disorganized experimental data is piled up together, creating irrelevant and meaningless content once again. This law proposes indices specifically designed for checking this common bad practice.

The underlying mindset can be attributed to two main factors.

The first factor is the prevalence of empiricism, which views experiments as the equal of scientific research and thereby ignores the value of establishing and communicating scientific abstractions. As an important infrastructure for cutting-edge research in physics, the Beijing Electron-Positron Collider has already generated over 10PB of experimental data. Many important scientific discoveries originally come from these raw data. Printing and compiling these data into a book will definitely become an invaluable work of science, and every physicist should have a set, right? Nope. Because the bandwidth of human communication using symbol language is only about 39bps, and the maximum amount of symbol data a person can accept within a 100-year lifespan is only 15GB. It is impossible for anyone to read through these data in their lifetime, let alone discover anything from it. What has true scientific value is not the data presented in graphs or tables, but the abstract scientific concepts verified by the data. The role of the experiments is only to verify the scientific ideas. Experiments themselves can be said to be not even first-class citizens, and in many cases, second-class. In scientific research, it is often the abstract scientific ideas that are proposed first, and then experiments are designed specifically to verify their authenticity.

The second factor is that theses are mistaken for Proof of Works (PoW): The experiments during my postgraduate studies were the direct proofs of the work I did, which I listed as my work report to my boss. This is to misunderstand one’s own identity in scientific research. Graduate students are not laborers. Scientific exploration is not a form of labor for hire, and the effort put into scientific work usually does not yield a corresponding reward (or else it would be considered as “paper spamming”), the workload itself has limited significance. Papers should serve as the carrier of creative ideas, rather than a record of work hours for an employer - the latter will only be cared about by your employer and no one else. Your advisor, reviewers, examiners, and potential readers of your thesis, are they your employers? What do they care about and what do they want to see? Before writing a thesis, it is essential to first understand this point.

Of course, students will also ask: “You know nothing about my situ! My work is just too trivial to collect three research points, so what else can I do?” Here, I must remind those who haven’t graduated yet that you should avoid ending up in such a situation unintentionally. When writing a thesis, it is essential to plan in advance. Before starting the research work, layout the rough content of three research points within the same field, and carry out one or two research projects under the guidance of your advisor specifically for the purpose of this layout. As soon as you complete a valuable research work, it is easy to extend three research points, such as: using the research work itself as the second point; searching for the defects of the research work and further improving it, which is the third point; summarizing the ideas of the two research works, proposing a scientific, concise metaphysic (model, paradigm, principle, style), which is the first point. The metaphysic part can be completed entirely during the process of writing the thesis.

In the worst case, you must also adhere to the principle of truthfulness. Do list the works you have done, and use the experiments to verify each. Do not spam experiments to stuff the chapters.

Law 10: The minimum distance between titles () is positively correlated with the quality of the thesis.

Unlike all previous laws, the tenth law is a recommend, so it proposes the only positively relevant index. It is not easy to write an thesis with a clear structure and explicit clues. However, there is a quick and effective way: Force yourself to clearly list out the clues for every sentence, and write them right under the corresponding titles in all levels. Let’s take related work as an example:

2 Related works

This chapter introduces the important related work in a certain field, so that readers can understand the content later. From one aspect, the research work in a certain field can be roughly divided into categories of A, B, and C; from another aspect, these works can be divided into sub-aspects, foo, bar, buz. This chapter will introduce these aspects in a certain order. The author suggests that readers who are interested in a certain field can read [18] for a deeper understanding.

2.1 A

One of the more intuitive ways to solve problems in a certain field is to adopt method A. Method A has characteristics of blah, blah, and blah, and has been widely used in various fields such as blah [19], blah [20], blah [21], etc.

John [22] proposed Afoo, and… but it still lacks…

Addressing this concern, Jane [23] proposed Abar…

……

2.2 B

A is effective on… However, … As a result, B is in turn extensively researched since 2006. Differs with A, B is specialized on D, E, and F…

……

Using this method, readers can obtain a clear guide to the later content from the title at each point, which will significantly improve the reading experience. More importantly, consciously following this fixed writing method can force you to organize the narrative of the thesis. As long as it can be written out, it can ensure that the thesis is structured, organized, and reasonable. This writing method brings the characteristic that there is a guiding text below each title, without any two titles being adjacent. Therefore, I use the minimum distance between titles to describe this characteristic.

Conclusion

A thesis paper is an unique material in a student’s life, representing their academic level when they obtain their degree, and is always available for reference in the database. Although my doctorate thesis has received high praise from everyone, there are still some errors in my thesis that need to be corrected; Having my thesis published and proofread it again makes me feel very regretful. I suggest that students should treat writing a thesis as a major event in their lives and put as much passion and effort into it as possible.

In this post, I simply discussed the process of writing a thesis from a critical perspective. Relatively, it is still not so constructive. I hope that after reading these contents, students can reflect on them and understand the basic requirements of the thesis, and establish a basic aesthetic judgment of the thesis. Recognizing one’s own shortcomings is the basis for progress; understanding that writing a thesis requires many restrictions and requires a lot of hard work to establish the correct mindset.

Last month, I was invited by Li to give a report on the experience of writing a thesis and defending it to graduates, which included more constructive feedback. Because the time was urgent at that time, many of the content was not fully developed or even incomplete. Writing a thesis is an important task that requires long-term planning. The most important aspect of writing a thesis is to organize and plan well for all the research work, and this requires cautious planning from the beginning of the research. Therefore, I think that such reports shouldn’t be given to graduates in May, but should instead be given to incoming freshmen in September. I plan to complete the content before September and share it then.

Cambricon-P：A Bitflow Architecture for Arbitrary Precision Computing

2022-10-26 paper conference micro

In scientific computing tasks, both high-precision (hundreds to millions of bits) numerical data and low-precision data (deep neural networks introduced in AI for Science algorithms) need to be processed. We judge that supporting the emerging paradigm of future scientific computing requires a new architecture different from deep learning processors.

We analyzed the application characteristics of high-precision numerical computing and found that there is an “anti-memory wall” phenomenon: data transmission bottlenecks always appear at the near-computing storage hierarchy. This is because the data locality of numerical multiplication is so strong. If the word length natively supported by the computing device is not long enough, the operation needs to be decomposed into limbs, resulting in a huge amount of intermediate result memory access requests. Therefore, the new architecture must have: 1. The ability to process numerical multiplication; 2. The ability to process larger limbs at one time; 3. The ability to efficiently process large amounts of low-precision data.

Addressing these requirements, we designed the Cambricon-P architecture.

The Cambricon-P architecture has the following features:

Use bitflow to support high-precision numerical multiplication and low-precision linear algebra operations at the same time;
The carry-select mechanism alleviates the strong data dependence in the multiplication algorithm to a certain extent, and achieves sufficient hardware parallelism;
Bit-indexed inner product algorithm decomposes the calculation process into single-bit vectors, exploits the redundant computing hidden in finite field linear algebra, and reduces the computing logic required to 36.7% of the trivial method.

In order to achieve a larger-scale architecture and support larger limbs, the overall architecture of Cambricon-P is designed in a fractal manner, reusing the same control structure and similar data flows at the Core level, the Processing Element (PE) level, and the Inner Product Unit (IPU) level. We draftly developed the computing library MPApca for the Cambricon-P prototype, which implemented the Toom-Cook 2/3/4/6 fast multiplication algorithm and the Schönhage–Strassen fast multiplication algorithm (SSA) to evaluate against existing HPC systems such as CPU+GMP, GPU+CGBN.

Experimental results show that MPApca achieves up to 100.98x speedup compared to GMP in a monolithic multiplication, and achieves on average 23.41x speedup, 30.16x energy efficiency improvement in four typical applications.

Published on MICRO 2022. [DOI] [PDF]

Since starting from 1968, there are 1,900+ papers accepted on MICRO. Cambricon-P is awarded as Best Paper Runner-up, resulting in as the 4th nominee from China mainland institutes.

CCF Chips 2022: The Post Golden Age of Computer Architectures, What's Next to the Deep Learning Processors?

2022-08-05 talk

Invited by Prof. LI, Chao of SJTU, I attended the 1st CCF Chips conference, and gave an online talk to the session of Outstanding Young Scholars in Computer Architecture. In the talk I shared the results and extra thoughts from a paper under (single-blind) review, Rescue to the Curse of Universality. Since the results are still undergoing peer review, the points are for reference only.

UPD: Paper published on SCIENCE CHINA: Informational Sciences. [DOI] [PDF]

Take home messages:

The Curse: It is impossible to build a computer that is both universal and efficient in logic circuits.
REnc Architecture: It is sufficient to be an efficient architecture that taking the locality as a scale-invariant constraint.
The results suggest that today’s DSAs may be over-specialized. There are theoretical rooms of universality while keeping the efficiency optimal.

With the laws from the semiconductor industry, we predict that DLPs are going to be more universal, more applicable, more standarized, and more systematic.

(The talk is only given in Mandarin currently.)

MPMD decides Halting

2022-08-04 technotes

I have met JI, Yu, another CCF distinguished doctoral thesis award recipient, in Xining. He is very confused about my doctoral thesis: How could you make such a complex system into fractals? You really did that at Cambricon? I explained to him that fractalness is very primitive, rather than being made by me for thesis writing. In the talkshow of the day, I redefined the concept of Fractal Parallel Computing: Fractalness (parallel computing) is the primitive property of parallel computed problems, that there exists a finite program to control the arbitrarily scaled parallel machines, and to solve the problem of any size in finite time.

It is not only Doctor JI. Many peers have raised similar concerns. Indeed, it is showing that my works are still very preliminary. I must admit that fractal machines are yet hardly useful in practice to any other developers. But take an advice from me surely: if your system is not fractal, you should be really very cautious!

Allowing separate programming over a parallel machine can make the situation awkward: The excessive programs serve as advice strings to the computation.

Take an MPMD model for example. Since I have nodes to program and work in parallel, I can hardcode one bit of information into each node. At the very begining of the computation, I can gather the bits, forming an -bit natural number and distribute it to all nodes. Let us denote that number as . Trivially, can represent values up to .

Why would that matters? is smuggled into the programs, which depends on the scale of the parallel machine, which is further scalable with the problem size , thus introduced non-uniformity. Say, we let indicate the number of Turing Machines of descriptive size that halt on empty tapes. Then we get the following algorithm immediately:

Input as the description of a TM ( bit).
Gather and distribute the value of .
.
Parallel-for each :
1. Simulate on an empty tape until halting.
2. Increase atomically.
3. If , abort the parallel for.
Return whether the -th simulation is finished or aborted.

It decides the Halting problem in finite time.

If your system does not allow that (deciding non-recursive problems), it is already fractal, although not necessarily in the identical form as in my thesis. The fundamental problem is that, the excessive programs are harmful. Theoretically they cause the non-computable problems being computable on your system; Practically they raise various problems, including the rewriting every year problem we encountered in Cambricon (Cambricon-F: Machine Learning Computers with Fractal von Neumann Architecture). Fractal computing systems are against these bad practices.

I hope the above argument makes the point clear.

Reference

Pekka Orponen, Ker-i Ko, Uwe Schöning, and Osamu Watanabe. 1994. Instance complexity. J. ACM 41, 1 (Jan. 1994), 96–121.