# benchmark op # naive C with openmp for for for # unroll, first try h # register allocation kernels # unroll, second try simd # neon intrinsics optional # naive neon assembly with pld asm # pipeline optimize, first try more register load mla # pipeline optimize, second try interleave load mla # pipeline optimize, third try loop tail # usual practice, load/save 233 # usual practice, unroll 233 # usual practice, save register 233