个人学术主页

去年在做 Cambricon-Q：量化训练架构的工作时，我发现许多同学不会做误差估计，也没有意识到随机实验中需要进行误差估计。体系结构领域似乎并不重视这一点，同行做实验时常常在真实系统上测试跑一个几十毫秒的程序，然后直接用time命令截取用时，再汇报模拟实验结果的加速比。这样的测量是有误差的，结果常常在一个范围内波动；如果加速比达到上百倍，一点误差可能会导致最终报告出完全不同的数值。小初高通识教育中，实验课会教我们“多次测量取平均值”，但“多次”应该是多少次？5次够吗？10次呢？100次呢？还是直到工作截稿之前，一直重复做下去呢？我们需要了解科学的实验方法：重复随机实验中的误差估计。我发现并非所有本科概率论课程都教授这些，但要从事科学研究工作，这却是一项必备技能。

我不是统计学家，本科时我也是全校挂科王，肯定算不上概率论专家。但我曾经作为业余爱好钻研过蒙特卡洛模拟，我来展示我自己是如何完成这个过程的。如果有误，还请概率论专家们不吝纠正。

高斯估计

我们有一组实验测量值，现在假设它们独立同分布，服从于高斯分布。

虽然分布参数中有，这是我们想得到的，但不会直接体现在测量值中。为了逼近，我们可以取测量值的算术平均值。我们首先先求和，得到，然后除以。这时算术平均值。注意一个高斯分布随机变量除以后，方差会相应缩减倍。

虽然以上推演过程首先假设了高斯分布，但是根据列维-林德伯格中心极限定理(Levy-Lindeberg CLT)，任意无限个随机变量叠加后，其和都趋近于服从高斯分布，只要这些随机变量满足以下两个条件：

方差是有限的；
独立。

这样，无论实验测量值实际服从的分布是什么，我们都可以直接从CLT导出最终的分布，结果是一样的。只要实验次数足够多，就趋近于。

高斯分布告诉我们，我们有99.7%的把握说，与我们想要获取的真实值之间的差距不超过三倍的标准差，即。我们说，的99.7%置信度的置信区间是。随着实验次数增多，的标准差还会持续缩小。原来我们只知道多次测量取平均值会奏效，现在我们还知道收敛速度了：每增加4倍样本，误差将缩小2倍。

我们用到的常数3和99.7%是查表得出的高斯分布累积分布函数(CDF)上的一个点，类似常用值还有1.96和95%。从确定的置信区间推算置信度，我们使用CDF：，但CDF只留了右边尾分布不计，但我们估计的区间是双边的，因此置信度还要扣除左边尾分布，即。如果要用目标置信度来推算置信区间，则需要使用CDF的逆函数ICDF，也叫PPF。

编程实现时，高斯分布的CDF可以比较直接地用数学库函数erf实现。它的逆函数erfinv并不常见，在C语言和C++的标准库里就不存在，不过因为CDF都是连续、单调、可导的，其导函数即概率密度函数(PDF)，因此用牛顿法、割线法、二分法等数值方法都可以比较高效地求逆。

然而，以上估计完全依赖于CLT，CLT毕竟只描述了一个极限情况。实际实验时我们取5次、10次实验结果，离无限次实验差得远，难道也强行应用CLT吗？这显然不科学。

“学生” t-分布估计

描述有限多次随机实验叠加值，有一个更专用的概率分布，即“学生” t-分布。除了以上高斯估计推算的过程，t-分布还额外考虑了和真实误差之间的差异。

t-分布与自由度直接相关，在重复随机实验里自由度可以理解为。自由度低时，t-分布的两端更长，可以导出一个更保守的误差估计；当自由度趋近于无穷时，t-分布也趋近于高斯分布。例如5次重复实验，自由度为4，如果用高斯估计99.7%置信度的误差结果为3倍标准差，用“学生” t-分布则估计为6.43倍，显著地扩大了。

当试验次数不多时，我们直接用t-分布代替高斯分布来描述。特别是在依靠误差估计来判定重复实验的提前结束时，使用高斯估计更容易发生过早的结束（特别是在前两、三个样本凑巧相差不大时）。

实现

在Python中，直接调包：高斯估计使用scipy.stats.norm.cdf和scipy.stats.norm.ppf；“学生” t-分布估计使用scipy.stats.t.cdf和scipy.stats.t.ppf。

C++里比较难做，STL库功能不全，而且引入外部库不方便。下面的代码给出了两种误差估计的C++实现。两条注释条之间的内容可以抽出组成一个头文件供直接使用，更下面的部分则展示了一个样例程序：从分布的随机变量中重复实验，直到误差估计认为有95%的把握将真实值囊括在估计值的正负10%的范围内。

代码

errest.cppview raw

/*
     error estimator for repeated randomized experiments.
     every scientists should know how to do this...
     requires c++17 or above.

     Copyright (c) 2021,2022 zhaoyongwei<zhaoyongwei@ict.ac.cn>
     IPRC, ICT CAS. Visit https://yongwei.site/error-estimator

     Permission is hereby granted, free of charge, to any person obtaining a copy
     of this software and associated documentation files (the "Software"), to deal
     in the Software without restriction, including without limitation the rights
     to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
     copies of the Software, and to permit persons to whom the Software is
     furnished to do so, subject to the following conditions:
     
     The above copyright notice and this permission notice shall be included in all
     copies or substantial portions of the Software.
     
     THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
     IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
     FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
     AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
     LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
     OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
     SOFTWARE.
*/

// ====================================================================
//           contents before the DEMO rule is a header.
// ====================================================================
// #pragma once // <-- don't forget opening this in a header.
                //     compilers don't like this in the main file.

#define __STDCPP_WANT_MATH_SPEC_FUNCS__ 1 // for std::beta
#include <cmath>
#include <iterator>
#include <tuple>

namespace errest {

// gaussian distribution. works as an approximation in most cases
// due to CLT, but is over-optimistic, especially when samples are
// few.
template<class RealType = double>
struct gaussian {
  static RealType critical_score(RealType cl, size_t df) {

    // due to CLT, mean of samples approaches a normal distribution.
    // `erf` is used to estimate confidence level.
    // now we set a confidence level and try obtain corresponding
    // confidence interval, thus we need the inverse of `erf`,
    // i.e. `erfinv`.
    using std::log; using std::sqrt; using std::erf; using std::exp;

    // solve erfinv by newton-raphson since c++stl lacks it.
    auto erfinv = [](RealType x) {
      constexpr RealType k = 0.88622692545275801; // 0.5 * sqrt(pi)
      RealType step, y = [](RealType x) { // a rough first approx.
           RealType sign = (x < 0) ? -1.0 : 1.0;
           x = (1 - x)*(1 + x);
           RealType lnx = log(x);
           RealType t1 = RealType(4.330746750799873)+lnx/2;
           RealType t2 = 1 / RealType(0.147) * lnx;
           return sign * sqrt(-t1 + sqrt(t1 * t1 - t2));
      }(x);
      do {
        step = k * (erf(y) - x) / exp(-y * y);
        y -= step;
      } while (step); // converge until zero step.
                      // this maybe too accurate for our usecase?
                      // add a threshold or cache if too slow.
      return y;
    };
    // we hope that the "truth" E[mean] lands in the interval
    // [mean-x*sigma, mean+x*sigma] with a probability of
    //                conf_level = erf( x/sqrt(2) )
    // conversely,
    //                x = sqrt(2) * erfinv(conf_level)
    return sqrt(RealType(2)) * erfinv(cl);
    // to get interval, mult this multiplier on the standard error of mean.
  }
};

// student's t-distribution. yields an accurate estimation even for few
// samples, but is more computational expensive.
template<class RealType = double>
struct student {
  static RealType critical_score(RealType cl, size_t df) {
    
    // similar to gaussian, we need to evaluate ppf for critical score.
    // ppf is the inverse function of cdf, and pdf is the derivative
    // function of cdf. we find the inverse of cdf by newton-raphson.
    using std::beta; using std::sqrt; using std::pow; using std::lgamma;
    using std::exp; using std::log; using std::nextafter;

    auto pdf = [](RealType x, RealType df) {
      return pow(df / (df + x * x), (df + 1)/2) /
        (sqrt(df) * beta(df/2, RealType(0.5)));
    };

    auto cdf = [](RealType x, RealType df) {
      // ibeta, derived from https://codeplea.com/incomplete-beta-function-c
      // original code is under zlib License,
      // Copyright (c) 2016, 2017 Lewis Van Winkle
      // i believe this snippet should be relicensed due to thorough rework.
      auto ibeta = [](RealType a, RealType b, RealType x){
        const RealType lbeta_ab = lgamma(a)+lgamma(b)-lgamma(a+b);
        const RealType first = exp(log(x)*a+log(1-x)*b-lbeta_ab) / a;
        RealType f = 2, c = 2, d = 1, step; int m = 0;
        do {
            RealType o = -((a+m)*(a+b+m)*x)/((a+2*m)*(a+2*m+1));
            m++; RealType e = (m*(b-m)*x)/((a+2*m-1)*(a+2*m));
            RealType dt = 1 + o/d; d = 1 + e/dt;
            RealType ct = 1 + o/c; c = 1 + e/ct;
            f *= (step = (ct+e)/(dt+e));
        } while(step - 1.);
        return first * (f - 1);
      };
      RealType root = sqrt(x*x+df);
      return ibeta(df/2, df/2, (x + root)/(2*root));
    };

    cl = 1 - (1 - cl) / 2;
    // solve the root of cdf(x) - cl = 0.
    // first use newton-raphson, generally starting at 0 since cdf is
    // monotonically increasing. disallow overshoot. when overshoot
    // is detected, an inclusive interval is also determined, then
    // switch to bisection for the accurate root.
    RealType l = 0, r = 0, step;
    while((step = (cdf(r, df) - cl) / pdf(r, df)) < 0) {
      l = r; r -= step;
    }
    while (r > nextafter(l, r)) {
      RealType m = (l + r) / 2;
      (cdf(m, df) - cl < 0 ? l : r) = m;
    }
    return nextafter(l, r); // let cdf(return val) >= 0 always holds.
  }
};

template<template<class> class Distribution = gaussian, class RealType = double>
struct estimator {

  // std::tie(mean, error) = estimator::test(first, last, cl)
  //
  // perform error estimation over the range [`first`, `last`). 
  // the ground truth should be included in the interval
  // [`mean-error`, `mean+error`] with a probability of `cl`.
  template<class InputIt>
  static std::tuple<RealType,RealType> test
          (InputIt first, InputIt last, RealType cl = 0.95) {

    // sampling has 1 less degrees of freedom.
    auto n = distance(first, last);
    auto df = n - 1;

    // pairwise sum to reduce rounding error.
    using std::distance; using std::next; using std::sqrt;
    auto sum = [](InputIt first, InputIt last,
                         auto&& trans, auto&& sum)->RealType {
      if (distance(first, last) == 1) {
        return trans(*first);
      } else {
        auto mid = first + distance(first, last) / 2;
        return sum(first, mid, trans, sum) + sum(mid, last, trans, sum);
      }
    };

    auto trans_mean = [n](RealType x)->RealType {
      return x/n;
    };
    auto mean = sum(first, last, trans_mean, sum);
    auto trans_var = [mean, df](RealType x)->RealType {
      return (x - mean) * (x - mean) / df;
    };
    auto var = sum(first, last, trans_var, sum);

    // assuming samples are i.i.d., we estimate `mean` as
    //
    //       samples[i] ~ N(mean, var)
    //              sum = Sum(samples)
    //                  ~ N(n*mean, n*var)
    //             mean = sum/n
    //                  ~ N(mean, var/n)
    // the standard error of mean, namely `sigma`,
    //            sigma = sqrt(var/n).
    //
    // due to Levy-Lindeberg CLT, above estimation to `mean`
    // approximately holds for any sequences of random variables when
    // `m_iterations`->inf, as long as:
    //
    //   * with finite variance
    //   * independently distributed
    
    auto sigma = sqrt(var/n);

    // finally, confidence interval is (+-) `critical_score*sigma`.
    // here critical_score is dependent on the assumed distribution.
    // we provide `gaussian` and `student` here.
    //
    //  - `gaussian` works well for many (>30) samples due to CLT,
    //    but are over-optimistic when samples are few;
    //  - `student` is more conservative, depends on number of
    //    samples, but is much more computational expensive.
    
    return { mean, Distribution<RealType>::critical_score(cl, df) * sigma };
  }
};

using gaussian_estimator = estimator<gaussian>;
using student_estimator = estimator<student>;

};

// ====================================================================
//      DEMO: experiment with RNG, try to obtain its expectation.
// ====================================================================
#include <random>
#include <vector>
#include <iostream>
#include <iomanip>

int main(int argc, char** argv) {

  // the "truth" hides behind a normal distribution N(5,2).
  // after this demo converged, we expect the mean very close to 5.
  // we use student's t-distribution here, which works better for
  // small sets.
  // if you change the distribution to gaussian, you will likely
  // see an increased chance to fail the challenge. 
  std::random_device rd;
  std::mt19937 gen(rd());
  std::normal_distribution<double> dist{5,2}; // <-- the "truth"

  // demo configuration.
  constexpr double confidence_level = 0.95; // 95% confidence to pass the challenge.
  constexpr double error_tolerance = 0.10;  // stop below +-10% error, ~[4.5, 5.5].

  double mean, error;
  size_t iteration = 5; // experiments per test.
  std::vector<double> samples;

  while (1) {
    // draw 5 more samples then test.
    std::cout << "Single Experiment Results: ";
    for (size_t i = 0; i < iteration; i++) {
      double sample = dist(gen); // <-- perform a single experiment.
      samples.push_back(sample);
      if (i < 10)
        std::cout << sample << " ";
      else if (i < 11)
        std::cout << "...";
    }
    std::cout << std::endl;

    // test the confidence interval with our estimator.
    std::tie(mean, error) = errest::student_estimator::test(samples.begin(), samples.end(), confidence_level);
    // or with structured binding declaration [c++17]:
    // auto [mean, error] = errest::student_estimator::test(samples.begin(), samples.end(), confidence_level);

    std::cout << " * Error Estimator Report: "
      "collected " << samples.size() << " results, "
      "mean: " << mean << ", error: " << error << std::endl
      << std::endl;

    // stop if error is below 10%,
    if (error < mean * error_tolerance) break;

  }

  std::cout << "demo converged with " << (confidence_level * 100) << " percent confidence "
    "within " << (error_tolerance * 100) << " percent error range. "
    "is the \"truth\" 5 within the interval "
    "[" << mean-error << ", " << mean+error << "]?" << std::endl;
  return 0;

}