在看波推荐的《competitive programers handbook》，第一次知道

3 天前

在看波推荐的《competitive programers handbook》，第一次知道 https://leetcode.com/problems/maximum-subarray/ 的 dp 解法原来是有名字的叫做 Kadane’s algorithm，然后学习了二分查找原来还有一种步进写法

int k = -1;
for (int b = n; b >= 1; b /= 2) {
    while (k + b < n && array[k + b] <= x)
        k += b;
}
if (k >= 0 && array[k] == x) {
    // found at k
}

循环不变式是 array[k] <= x，美

但更有趣的是 https://leetcode.com/problems/find-common-elements-between-two-arrays/ 这题，书上讨论了两种做法，第一种是对 array1 建立 hashmap，然后再遍历 array2 检查 x in hashmap，时间复杂度 O(n)；第二种是两个 array 排序，然后用类似 merge sort 合并时的方法双指针同时遍历，时间复杂度 O(nlogn)。

但是话锋一转，书上说虽然 hashmap 解法理论时间复杂度最优，但在 len(array) 数量级在1e6、数据是 1..1e9 随机整数这种级别时，实际跑起来却是排序算法更快。

我不信。虽然我已经见识过不少“理论时间复杂度 vs CPU ILP 输得一败涂地”的例子，但这个 case 我还是不信，所以我用 go 糊了一个测试： https://go.dev/play/p/XmPrC2VWIyX

$ taskset -c 1 ./bench 
           n         hashset       sortmerge       count      winner
------------  --------------  --------------  ----------  ----------
     1000000      123.259 ms      127.794 ms        1070     hashset
     2000000      271.036 ms      267.713 ms        3956   sortmerge
     3000000      452.092 ms      407.585 ms        8983   sortmerge
     4000000      592.858 ms      557.676 ms       16003   sortmerge
     5000000      775.459 ms      706.228 ms       24922   sortmerge
     6000000      996.271 ms      864.900 ms       35715   sortmerge
     7000000     1419.428 ms     1035.839 ms       48523   sortmerge
     8000000     1345.405 ms     1182.133 ms       63042   sortmerge
     9000000     1426.785 ms     1351.538 ms       79642   sortmerge
    10000000     1655.852 ms     1518.615 ms       98806   sortmerge

O(nlog) 的算法比 O(n) 算法快 37%？计算机真有意思😋Let's perf!

先看 IPC，发现 hash 算法的 IPC 低达 0.5，sort 算法高达（ガンダム）1.3，micro bench 启动！

$ taskset -c 1 perf stat -ddd ./bench sortmerge
6,771,796,030      cpu_core/instructions/           #    1.30  insn per cycle 

$ taskset -c 1 perf stat -ddd ./bench hashset
3,753,073,963      cpu_core/instructions/           #    0.51  insn per cycle

然后做一轮 topdown，发现居然是 tma_mem_bandwidth 造成了巨量 stall

$ taskset -c 1 perf stat -M topdownl1
77.2 %  tma_backend_bound
11.1 %  tma_bad_speculation    
2.7 %  tma_frontend_bound     
9.1 %  tma_retiring 

$ taskset -c 1 perf stat -M tma_backend_bound_group
9.4 %  tma_core_bound         
68.1 %  tma_memory_bound

$ taskset -c 1 perf stat -M tma_memory_bound_group
63.0 %  tma_dram_bound
4.3 %  tma_l3_bound
0.9 %  tma_l2_bound
1.5 %  tma_l1_bound
1.6 %  tma_store_bound

$ taskset -c 1 perf stat -M tma_dram_bound_group
65.0 %  tma_mem_bandwidth
24.9 %  tma_mem_latency

看一下这个 perf metrics 用的什么 pmu: tools/perf/pmu-events/arch/x86/meteorlake/mtl-metrics.json

    {
        "MetricExpr": "min(cpu_core@CPU_CLK_UNHALTED.THREAD@, cpu_core@OFFCORE_REQUESTS_OUTSTANDING.DATA_RD\\,cmask\\=4@) / tma_info_thread_clks",
        "MetricName": "tma_mem_bandwidth",
    },

好好好，intel 祖传启发式 perf metrics，这个 pmu 是收集 outstanding offcore read >=4，手动采集一下事件看看是哪里的代码造成的：

$ taskset -c 1 perf record -g -e 'cpu_core/OFFCORE_REQUESTS_OUTSTANDING.DATA_RD,cmask=4/'
$ perf report  --stdio
                          |--50.38%--runtime.mapaccess2_fast64
                          |--46.22%--runtime.mapassign_fast64
                           --1.41%--runtime.memhash64

来看看 runtime.mapaccess2_fast64 里具体哪一行导致的

$ perf annotate --stdio --symbol=runtime.mapassign_fast64
         : 247    match := g.ctrls().matchH2(h2(hash))
    0.00 :   4067f3: mov    (%r12),%r15
   88.76 :   4067f7: movq   %r15,%xmm1

注意这个采样没有 PEBS （perf evlist -v 输出里没有 precise_ip），annotate 结果有 off-by-one 的问题，所以实际应该是 mov (%r12),%r15 这个指令及之前的几个内存读，代码是 g.ctrls().matchH2(h2(hash)) 不过我不懂 swiss map，大概是 map 过大之后查询全部 L3 miss 落到内存读撞到内存墙了。

理解了根因(?)之后，修改就有思路了，不必魔改 swiss map，let's do partition 嘛，把原始的 1e6 size 的 array 分成 N 个 buckets，然后把相同的 bucket 的重排到一起，然后对每个 bucket 就可以用独立的小 map 了： https://go.dev/play/p/EJdUiFsN229 。这样改下来 O(n) 的力量又降临了，在 n=1e6 的规模上比 O(nlogn) 快五倍。

计算机太有意思了，连力扣都变得好玩起来了🥰