feat: 切换后端至PaddleOCR-NCNN，切换工程为CMake

1.项目后端整体迁移至PaddleOCR-NCNN算法，已通过基本的兼容性测试 2.工程改为使用CMake组织，后续为了更好地兼容第三方库，不再提供QMake工程 3.重整权利声明文件，重整代码工程，确保最小化侵权风险 Log: 切换后端至PaddleOCR-NCNN，切换工程为CMake Change-Id: I4d5d2c5d37505a4a24b389b1a4c5d12f17bfa38c
2022-05-10 09:54:44 +08:00
parent ecdd171c6f
commit 718c41634f
10018 changed files with 3593797 additions and 186748 deletions
--- a/3rdparty/ncnn/docs/Home.md
+++ b/3rdparty/ncnn/docs/Home.md
@ -0,0 +1,139 @@
+### input data and extract output
+```cpp
+#include <opencv2/core/core.hpp>
+#include <opencv2/highgui/highgui.hpp>
+#include "net.h"
+
+int main()
+{
+    cv::Mat img = cv::imread("image.ppm", CV_LOAD_IMAGE_GRAYSCALE);
+    int w = img.cols;
+    int h = img.rows;
+
+    // subtract 128, norm to -1 ~ 1
+    ncnn::Mat in = ncnn::Mat::from_pixels_resize(img.data, ncnn::Mat::PIXEL_GRAY, w, h, 60, 60);
+    float mean[1] = { 128.f };
+    float norm[1] = { 1/128.f };
+    in.substract_mean_normalize(mean, norm);
+
+    ncnn::Net net;
+    net.load_param("model.param");
+    net.load_model("model.bin");
+
+    ncnn::Extractor ex = net.create_extractor();
+    ex.set_light_mode(true);
+    ex.set_num_threads(4);
+
+    ex.input("data", in);
+
+    ncnn::Mat feat;
+    ex.extract("output", feat);
+
+    return 0;
+}
+
+```
+
+### print Mat content
+```cpp
+void pretty_print(const ncnn::Mat& m)
+{
+    for (int q=0; q<m.c; q++)
+    {
+        const float* ptr = m.channel(q);
+        for (int z=0; z<m.d; z++)
+        {
+            for (int y=0; y<m.h; y++)
+            {
+                for (int x=0; x<m.w; x++)
+                {
+                    printf("%f ", ptr[x]);
+                }
+                ptr += m.w;
+                printf("\n");
+            }
+            printf("\n");
+        }
+        printf("------------------------\n");
+    }
+}
+```
+
+### visualize Mat content
+```cpp
+void visualize(const char* title, const ncnn::Mat& m)
+{
+    std::vector<cv::Mat> normed_feats(m.c);
+
+    for (int i=0; i<m.c; i++)
+    {
+        cv::Mat tmp(m.h, m.w, CV_32FC1, (void*)(const float*)m.channel(i));
+
+        cv::normalize(tmp, normed_feats[i], 0, 255, cv::NORM_MINMAX, CV_8U);
+
+        cv::cvtColor(normed_feats[i], normed_feats[i], cv::COLOR_GRAY2BGR);
+
+        // check NaN
+        for (int y=0; y<m.h; y++)
+        {
+            const float* tp = tmp.ptr<float>(y);
+            uchar* sp = normed_feats[i].ptr<uchar>(y);
+            for (int x=0; x<m.w; x++)
+            {
+                float v = tp[x];
+                if (v != v)
+                {
+                    sp[0] = 0;
+                    sp[1] = 0;
+                    sp[2] = 255;
+                }
+
+                sp += 3;
+            }
+        }
+    }
+
+    int tw = m.w < 10 ? 32 : m.w < 20 ? 16 : m.w < 40 ? 8 : m.w < 80 ? 4 : m.w < 160 ? 2 : 1;
+    int th = (m.c - 1) / tw + 1;
+
+    cv::Mat show_map(m.h * th, m.w * tw, CV_8UC3);
+    show_map = cv::Scalar(127);
+
+    // tile
+    for (int i=0; i<m.c; i++)
+    {
+        int ty = i / tw;
+        int tx = i % tw;
+
+        normed_feats[i].copyTo(show_map(cv::Rect(tx * m.w, ty * m.h, m.w, m.h)));
+    }
+
+    cv::resize(show_map, show_map, cv::Size(0,0), 2, 2, cv::INTER_NEAREST);
+    cv::imshow(title, show_map);
+}
+```
+
+### FAQ
+Q ncnn的起源
+
+A 深度学习算法要在手机上落地，caffe依赖太多，手机上也没有cuda，需要个又快又小的前向网络实现
+
+
+Q ncnn名字的来历
+
+A cnn就是卷积神经网络的缩写，开头的n算是一语n关。比如new/next(全新的实现)，naive(ncnn是naive实现)，neon(ncnn最初为手机优化)，up主名字(←_←)
+
+
+Q 支持哪些平台
+
+A 跨平台，支持 android / ios / linux / windows / macos，也支持裸机跑
+
+
+Q 计算精度如何
+
+A armv7 neon float 不遵照 ieee754 标准，有些采用快速实现(如exp sin等)，速度快但确保精度足够高
+
+
+Q logo
+
+A up主是mc玩家，所以灵魂手绘像素猫，还可以找到ncnn...
--- a/3rdparty/ncnn/docs/application-with-ncnn-inside.md
+++ b/3rdparty/ncnn/docs/application-with-ncnn-inside.md
@ -0,0 +1,48 @@
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.azarlive.android.png) Azar-视频交友与聊天 June 20, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.cyberlink.youcammakeup.png) 玩美彩妆 - 自拍美颜 & 智能美妆相机 June 21, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.makeup.png) You Makeup Photo Camera 2.1.5
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.cartoon.cam.png) 滤镜相机 Cartoon Camera- Paintlab January 24, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.pipcamera.activity.png) 画中画相机 January 30, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.hefe.pro.editor.png) Photo Editor Pro 1.1.4.1029
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.apus.camera.id.png) Air Camera 1.7.3.1002
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.fotoable.fotobeauty.png) 美丽拍－懂你的自拍美颜相机 February 1, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.perfectcorp.ycf.png) 玩美Fun-特效动图自拍滤镜&分享相片！ May 15, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.ufotosoft.justshot.png) Sweet Snap - 生活贴纸&图像编辑器,实时滤镜,录制视频和有趣表情包,美容效果 June 22, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.wantu.activity.png) 玩图 - 美图相机 March 29, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.meitu.meiyancamera.png) 美颜相机 7.6.95
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.lyrebirdstudio.colorizer.lite.png) 自拍相机 - 照片编辑器和过滤器和贴纸 April 27, 2018
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.apusapps.fulakora.png) APUS Camera 1.7.2.1001
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/video.like.png) LIKE短视频 — 魔法视频自拍神器 2.2.4
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.qiyi.video.png) 爱奇艺 9.6.0
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.eg.android.AlipayGphone.png) 支付宝 10.1.25.752
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.perfectcorp.beautycircle.png) YouCam Shop - World's First AR Makeup Shopping App 3.4.0
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.lyrebirdstudio.beauty.png) 美容化妆自拍相机和自拍照片编辑器 1.4.8
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.jingdong.app.mall.png) 京东-挑好物，上京东 7.0.8
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.versa.png) Versa 2.9.2
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.tencent.weishi.png) 微视 4.3.1.88
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.smile.gifmaker.png) 快手短视频—国民短视频平台 5.4.2.5360
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180626/com.sdu.didi.psnger.png) 滴滴出行 5.3.0
+
--- a/3rdparty/ncnn/docs/benchmark/the-benchmark-of-caffe-android-lib,-mini-caffe,-and-ncnn.md
+++ b/3rdparty/ncnn/docs/benchmark/the-benchmark-of-caffe-android-lib,-mini-caffe,-and-ncnn.md
@ -0,0 +1,118 @@
+caffe-android-lib https://github.com/sh1r0/caffe-android-lib
+
+mini-caffe https://github.com/luoyetx/mini-caffe
+
+openblas-0.2.20 https://github.com/xianyi/OpenBLAS
+
+ncnn https://github.com/Tencent/ncnn
+
+***
+
+squeezenet_v1.1 https://github.com/DeepScale/SqueezeNet/tree/master/SqueezeNet_v1.1
+
+mobilenet_v1 https://github.com/shicai/MobileNet-Caffe
+
+vgg16 https://gist.github.com/ksimonyan/211839e770f7b538e2d8
+
+***
+
+Host platform and compiler configuration: 
+
+fedora 27, android-ndk-r15c, target arch = arm64-v8a
+
+we manually update openblas package to version 0.2.20 in caffe-android-lib for better performance
+
+
+***
+
+Device: Nexus 6p
+
+OS: LineageOS 15.1(Android 8.1.0), ROM newly flashed without any third-party APP installed
+
+CPU: Snapdragon 810 (Cortex-A57 2.0GHz x 4 + Cortex-A53 1.55GHz x 4)
+
+RAM: 3G
+
+
+***
+
+Benchmark method: 
+
+Run squeezenet, mobilenet inference 23 times in a loop, discard the first three warmup records, and then calculate the average inference time
+
+Run vgg169 times in a loop, discard the first warmup record, and then calculate the average inference time
+
+Since the system may force SOC lowering its frequency when temperature goes high, sleep over 1 minute before each benchmark to prevent this issue.
+
+fps performance: fps = 1000 / avgtime(ms)
+
+cpu usage: take the CPU value in top utility output
+
+memory usage: take the RES value in top utility output
+
+the overall power consumption and performance per watt: 
+
+Disable usb charging: adb shell echo 0 > /sys/class/power_supply/battery/charging_enabled
+
+current(μA) = adb shell cat /sys/class/power_supply/battery/current_now (multiply -1 for 810 chip)
+
+voltage(μV) = adb shell cat /sys/class/power_supply/battery/voltage_now
+
+power consumption(mW) = current / 1000 * voltage / 1000 / 1000
+
+performance per watt(1000fps/W) = fps / power consumption * 1000
+
+
+***
+
+The binary size after debug stripping
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/1.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/2.jpg)
+
+***
+
+squeezenet
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/3.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/4.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/5.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/6.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/7.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/8.jpg)
+***
+
+mobilnet
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/9.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/10.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/11.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/12.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/13.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/14.jpg)
+***
+
+vgg16
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/15.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/16.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/17.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/18.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/19.jpg)
+
+![](https://github.com/nihui/ncnn-assets/raw/master/20180413/20.jpg)
--- a/3rdparty/ncnn/docs/benchmark/vulkan-conformance-test.md
+++ b/3rdparty/ncnn/docs/benchmark/vulkan-conformance-test.md
@ -0,0 +1,46 @@
+
+|device|gpu|api version|driver version|squeezenet|mobilenetssd|yolov3|
+|---|---|---|---|---|---|---|
+|intel-i7-7700|Intel(R) HD Graphics 630 (Kaby Lake GT2)|1.1.90|18.3.4|y|y|y|
+|GTX-1060|GeForce GTX 1060 3GB|1.1.95|418.172.0|y|y|y|
+|AMD-Radeon R9 M290X|AMD RADV PITCAIRN (LLVM 7.0.1)|1.1.70|18.3.4|y|y|y|
+|iphone-5s|Apple A7 GPU|1.0.82|0.2.1825|y|y|y|
+|huawei-nexus6p|Adreno (TM) 430|1.0.49|35.601.2388|y|y|y
+|vivo-y1731ca|Adreno (TM) 505|1.0.61|37.845.1429|y|n|n|
+|vivo-y85a|Adreno (TM) 506|1.0.61|2.944.3349|y|n|n|
+|vivo-x9s|Adreno (TM) 510|1.0.61|42.917.1172|y|y|y|
+|meizu-15|Adreno (TM) 512|1.0.38|29.189.223|n|n|n|
+|chuizi-jianguo-pro2|Adreno (TM) 512|1.0.38|21.219.2615|n|n|n|
+|xiaomi-note3|Adreno (TM) 512|1.0.38|39.369.2305|n|n|n|
+|oppo-r11|Adreno (TM) 512|1.0.38|42.977.756|n|n|n|
+|xiaomi-6x|Adreno (TM) 512|1.0.61|14.322.3739|y|y|y|
+|oppo-r11s+|Adreno (TM) 512|1.0.61|35.1004.3936|y|y|y|
+|vivo-x20a|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|
+|vivo-v1816a|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|
+|vivo-z1|Adreno (TM) 512|1.0.61|43.10.3141|y|y|y|
+|xiaomi-redmi-note5|Adreno (TM) 512|1.0.61|63.219.2354|y|y|y|
+|google-pixel|Adreno (TM) 530|1.1.87|512.354.0|y|y|y|
+|nubia-z17|Adreno (TM) 540|1.0.38|1.28.32|n|n|n|
+|samsung-galaxys8+|Adreno (TM) 540|1.0.61|29.896.3583|y|y|y|
+|oneplus-5t|Adreno (TM) 540|1.0.61|18.1023.2233|y|y|y|
+|google-pixel2|Adreno (TM) 540|1.1.66|512.313.0|y|y|y|
+|essential-ph-1|Adreno (TM) 540|1.1.66|512.319.0|y|y|y|
+|vivo-x23|Adreno (TM) 615|1.0.66|33.870.3328|y|y|y|
+|vivo-v1813ba|Adreno (TM) 615|1.0.66|33.870.3328|y|y|y|
+|xiaomi-8se|Adreno (TM) 616|1.0.66|30.913.18|y|y|y|
+|vivo-nex-a|Adreno (TM) 616|1.0.66|33.870.3328|y|y|y|
+|xiaomi-mix2s|Adreno (TM) 630|1.0.61|4.91.2976|y|y|y|
+|heisha-SKR-A0|Adreno (TM) 630|1.0.61|36.173.3586|y|y|y|
+|heisha-SKR-A0|Adreno (TM) 630|1.0.66|47.448.1532|y|y|y|
+|oneplus-6|Adreno (TM) 630|1.1.66|512.324.0|y|y|y|
+|vivo-iQOO|Adreno (TM) 640|1.1.87|512.361.0|y|y|y|
+|meitu-m8s|Mali-T880|1.0.14|500.910.1017|n|n|n|
+|huawei-p10|Mali-G71|1.0.53|151.949.2145|n|n|n|
+|huawei-mate9|Mali-G71|1.0.53|151.949.2145|n|n|n|
+|oppo-a73|Mali-G71|1.0.47|575.795.1934|n|n|n|
+|vivo-y97|Mali-G72|1.0.58|240.537.3580|n|n|n|
+|huawei-mate10|Mali-G72|1.0.66|14.0.0|y|y|y|
+|huawei-v10|Mali-G72|1.0.66|14.0.0|y|y|y|
+|huawei-vce-al00|Mali-G72|1.0.66|14.0.0|y|y|y|
+|huawei-mate20|Mali-G76|1.0.66|14.0.0|y|y|y|
+|huawei-pct-al10|Mali-G76|1.0.66|14.0.0|y|y|y|
--- a/3rdparty/ncnn/docs/developer-guide/aarch64-mix-assembly-and-intrinsic.md
+++ b/3rdparty/ncnn/docs/developer-guide/aarch64-mix-assembly-and-intrinsic.md
@ -0,0 +1,57 @@
+```c
+// v寄存器全部使用 %.4s
+// 128-bit vreg matches %.4s
+// a += b * c
+float32x4_t _a = vld1q_f32(a);
+float32x4_t _b = vld1q_f32(b);
+float32x4_t _c = vld1q_f32(c);
+asm volatile(
+    "fmla  %0.4s, %2.4s, %3.4s"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// v寄存器使用低64位  %.2s
+// low 64-bit vreg matches %.2s
+// a += b * c
+float32x2_t _a = vld1_f32(a);
+float32x2_t _b = vld1_f32(b);
+float32x2_t _c = vld1_f32(c);
+asm volatile(
+    "fmla  %0.2s, %2.2s, %3.2s"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// v寄存器单路使用 %.s[0] %.s[1] %.s[2] %.s[3]
+// 32-bit register matches %.s[0]
+// a += b * c[0]
+// a += b * c[1]
+// a += b * c[2]
+// a += b * c[3]
+float32x4_t _a = vld1_f32(a);
+float32x4_t _b = vld1_f32(b);
+float32x4_t _c = vld1_f32(c);
+asm volatile(
+    "fmla  %0.4s, %2.4s, %3.s[0]"
+    "fmla  %0.4s, %2.4s, %3.s[1]"
+    "fmla  %0.4s, %2.4s, %3.s[2]"
+    "fmla  %0.4s, %2.4s, %3.s[3]"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+
+
+qwq
--- a/3rdparty/ncnn/docs/developer-guide/add-custom-layer.zh.md
+++ b/3rdparty/ncnn/docs/developer-guide/add-custom-layer.zh.md
@ -0,0 +1,175 @@
+# NCNN增加自定义层
+
+## 举例
+
+这里举个例子添加自定义层次 如Relu6，即 std::min(6, std::max(0, val))
+
+```
+Input            input   0 1 input
+Convolution      conv2d  1 1 input conv2d 0=32 1=1 2=1 3=1 4=0 5=0 6=768
+Relu6            relu6   1 1 conv2d relu6
+Pooling          maxpool 1 1 relu6 maxpool 0=0 1=3 2=2 3=-233 4=0
+```
+
+
+
+## 定义源码h文件：src/layer/relu6.h
+
+```CPP
+#ifndef LAYER_RELU6_H
+#define LAYER_RELU6_H
+
+#include "layer.h"
+
+namespace ncnn {
+
+class Relu6 : public Layer
+{
+public:
+    Relu6();
+
+    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;
+};
+
+} // namespace ncnn
+
+#endif // LAYER_RELU6_H
+```
+
+
+
+## 定义源码CPP文件：src/layer/relu6.cpp
+
+```CPP
+#include "relu6.h"
+
+#include <math.h>
+
+namespace ncnn {
+
+Relu6::Relu6()
+{
+    one_blob_only = true;
+    support_inplace = true;
+}
+
+int Relu6::forward_inplace(Mat& bottom_top_blob, const Option& opt) const
+{
+        int w = bottom_top_blob.w;
+        int h = bottom_top_blob.h;
+        int channels = bottom_top_blob.c;
+        int size = w * h;
+
+        #pragma omp parallel for num_threads(opt.num_threads)
+        for (int q=0; q < channels; q++)
+        {
+            float* ptr = bottom_top_blob.channel(q);
+
+            for (int i=0; i<size; i++)
+            {
+                ptr[i] = std::min(6, std::max(0, ptr[i]));
+            }
+        }
+
+        return 0;
+}
+
+} // namespace ncnn
+
+```
+
+
+
+## 修改 src/CMakeLists.txt 注册Relu6
+
+```CPP
+ncnn_add_layer(GroupNorm)
+ncnn_add_layer(LayerNorm)
+ncnn_add_layer(Relu6)
+```
+
+
+
+## 定义测试用例CPP文件 src/test_relu6.cpp 
+
+```CPP
+#include "layer/relu6.h"
+#include "testutil.h"
+
+static int test_relu6(const ncnn::Mat& a)
+{
+    ncnn::ParamDict pd;
+
+    std::vector<ncnn::Mat> weights(0);
+
+    int ret = test_layer<ncnn::Relu6>("Relu6", pd, weights, a);
+    if (ret != 0)
+    {
+        fprintf(stderr, "test_relu6 failed a.dims=%d a=(%d %d %d)\n", a.dims, a.w, a.h, a.c);
+    }
+
+    return ret;
+}
+
+static int test_relu6_0()
+{
+    return 0
+           || test_relu6(RandomMat(5, 7, 24))
+           || test_relu6(RandomMat(7, 9, 12))
+           || test_relu6(RandomMat(3, 5, 13));
+}
+
+static int test_relu6_1()
+{
+    return 0
+           || test_relu6(RandomMat(15, 24))
+           || test_relu6(RandomMat(17, 12))
+           || test_relu6(RandomMat(19, 15));
+}
+
+static int test_relu6_2()
+{
+    return 0
+           || test_relu6(RandomMat(128))
+           || test_relu6(RandomMat(124))
+           || test_relu6(RandomMat(127));
+}
+
+int main()
+{
+    SRAND(7767517);
+
+    return 0
+           || test_relu6_0()
+           || test_relu6_1()
+           || test_relu6_2();
+}
+
+```
+
+
+
+## 修改tests/CMakeLists.txt 注册Relu6测试用例
+
+```CPP
+ncnn_add_layer_test(LSTM)
+ncnn_add_layer_test(Yolov3DetectionOutput)
+ncnn_add_layer_test(Relu6)
+```
+
+
+
+## 编译
+
+```
+按原NCNN步骤编译
+```
+
+
+
+## 单元测试
+
+```
+./test_relu6
+```
+
--- a/3rdparty/ncnn/docs/developer-guide/arm-a53-a55-dual-issue.md
+++ b/3rdparty/ncnn/docs/developer-guide/arm-a53-a55-dual-issue.md
@ -0,0 +1,85 @@
+## natural assembly
+* no register dependency, no penalty
+```
+ld1     {v0.4s}, [r0], #16
+fmla    v10.4s, v16.4s, v24.s[0]
+fmla    v11.4s, v16.4s, v24.s[1]
+fmla    v12.4s, v16.4s, v24.s[2]
+fmla    v13.4s, v16.4s, v24.s[3]
+```
+
+## A53
+* 128bit vector load cannot be dual issued with fmla, wait 2 cycles
+* 64bit vector load cannot be dual issued with fmla, wait 1 cycle
+* 64bit integer load can be dual issued with fmla, no penalty
+* pointer update can be dual issued with fmla, no penalty
+* 64bit vector load and 64bit vector insert can be dual issued, no penalty
+* any vector load cannot be issued on the 4th cycle of each fmla (enters the accumulator pipeline)
+
+### practical guide
+* use 64bit vector load only
+* issue vector load every three fmla
+* 1 cycle to load 64bit, dual issue with the previous interleaved 64bit insert
+* load the remaining 64bit into integer register, dual issue with fmla
+* update pointer, dual issue with fmla
+* insert 64bit into vector from integer register, dual issue with the next interleaved 64bit load
+* add nop every three fmla if no load, seems to be faster
+```
+ldr     d0, [r0] // 1 cycle, v0 first 64bit
+fmla
+ldr     x23, [r0, #8] // 0 cycle, v0 second 64bit to temp register
+fmla
+add     r0, r0, #16 // 0 cycle, update pointer
+fmla
+ldr     d1, [r0] // 1 cycle, v1 first 64bit
+ins     v0.d[1], x23 // 0 cycle, v0 second 64bit complete
+fmla
+ldr     x23, [r0, #8] // 0 cycle, v1 second 64bit to temp register
+fmla
+add     r0, r0, #16 // 0 cycle, update pointer
+fmla
+ins     v1.d[1], x23 // 1 cycle, v1 second 64bit complete
+nop
+fmla
+fmla
+fmla
+nop
+nop
+fmla
+fmla
+fmla
+```
+
+## A55
+* 128bit vector load cannot be dual issued with fmla, wait 2 cycles
+* 64bit vector load can be dual issued with fmla, no penalty
+* 64bit integer load can be dual issued with fmla, no penalty
+* pointer update can be dual issued with fmla, no penalty
+* 64bit vector insert can be dual issued with fmla, no penalty
+
+### practical guide
+* use 64bit vector load only
+* load 64bit, dual issue with fmla
+* load the remaining 64bit into integer register, dual issue with fmla
+* update pointer, dual issue with fmla
+* insert 64bit into vector from integer register, dual issue with fmla
+* interleaved load loose register dependency
+* nop trick is not needed
+```
+ldr     d0, [r0] // 0 cycle, v0 first 64bit
+fmla
+ldr     x23, [r0, #8] // 0 cycle, v0 second 64bit to temp register
+fmla
+add     r0, r0, #16 // 0 cycle, update pointer
+fmla
+ldr     d1, [r0] // 0 cycle, v1 first 64bit
+fmla
+ins     v0.d[1], x23 // 0 cycle, v0 second 64bit complete
+fmla
+ldr     x23, [r0, #8] // 0 cycle, v1 second 64bit to temp register
+fmla
+add     r0, r0, #16 // 0 cycle, update pointer
+fmla
+ins     v1.d[1], x23 // 0 cycle, v1 second 64bit complete
+fmla
+```
--- a/3rdparty/ncnn/docs/developer-guide/armv7-mix-assembly-and-intrinsic.md
+++ b/3rdparty/ncnn/docs/developer-guide/armv7-mix-assembly-and-intrinsic.md
@ -0,0 +1,130 @@
+```c
+// d寄存器全部使用 %P
+// d reg matches %P
+// a += b * c
+float32x2_t _a = vld1_f32(a);
+float32x2_t _b = vld1_f32(b);
+float32x2_t _c = vld1_f32(c);
+asm volatile(
+    "vmla.f32  %P0, %P2, %P3"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// q寄存器全部使用 %q
+// q reg matches %q
+// a += b * c
+float32x4_t _a = vld1q_f32(a);
+float32x4_t _b = vld1q_f32(b);
+float32x4_t _c = vld1q_f32(c);
+asm volatile(
+    "vmla.f32  %q0, %q2, %q3"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// d寄存器单路使用 %P[0] %P[1]
+// 32bit d reg matches %P[0]
+// a += b * c[0]
+// a += b * c[1]
+float32x2_t _a = vld1_f32(a);
+float32x2_t _b = vld1_f32(b);
+float32x2_t _c = vld1_f32(c);
+asm volatile(
+    "vmla.f32  %P0, %P2, %P3[0]"
+    "vmla.f32  %P0, %P2, %P3[1]"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// q寄存器单路使用 %e[0] %e[1] %f[0] %f[1]
+// 32-bit q reg matches %e[0]
+// a += b * c[0]
+// a += b * c[1]
+// a += b * c[2]
+// a += b * c[3]
+float32x4_t _a = vld1q_f32(a);
+float32x4_t _b = vld1q_f32(b);
+float32x4_t _c = vld1q_f32(c);
+asm volatile(
+    "vmla.f32  %q0, %q2, %e3[0]"
+    "vmla.f32  %q0, %q2, %e3[1]"
+    "vmla.f32  %q0, %q2, %f3[0]"
+    "vmla.f32  %q0, %q2, %f3[1]"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// q寄存器拆分d寄存器使用 %e %f
+// use %e %f to split q reg into two d regs
+// a += b * c[0]c[1]
+// a += b * c[2]c[3]
+float32x2_t _a = vldq_f32(a);
+float32x2_t _b = vldq_f32(b);
+float32x4_t _c = vld1q_f32(c);
+asm volatile(
+    "vmla.f32  %P0, %P2, %e3"
+    "vmla.f32  %P0, %P2, %f3"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// d寄存器声明绑定
+// specify concrete d reg which want to save
+// vmla.f32  d0, d2, d4
+register float32x2_t _a asm("d0") = vld1_f32(a);
+register float32x2_t _b asm("d2") = vld1_f32(b);
+register float32x2_t _c asm("d4") = vld1_f32(c);
+
+asm volatile(
+    "vmla.f32  %P0, %P2, %P3"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+```c
+// q寄存器声明绑定
+// bind q reg with data
+// vmla.f32  q0, q1, q2
+register float32x4_t _a asm("q0") = vld1q_f32(a);
+register float32x4_t _b asm("q1") = vld1q_f32(b);
+register float32x4_t _c asm("q2") = vld1q_f32(c);
+
+asm volatile(
+    "vmla.f32  %q0, %q2, %q3"
+    : "=w"(_a) // %0
+    : "0"(_a),
+      "w"(_b), // %2
+      "w"(_c)  // %3
+    :
+);
+```
+
+如果不是因为编译器的bug，寄存器绑定是用不着的，然而。。。
+
+https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41538
+
+qwq
--- a/3rdparty/ncnn/docs/developer-guide/binaryop-broadcasting.md
+++ b/3rdparty/ncnn/docs/developer-guide/binaryop-broadcasting.md
@ -0,0 +1,52 @@
+### broadcasting rule
+
+ncnn BinaryOp accepts blobs with different shape
+
+C = BinaryOp(A, B)
+
+shape notation convention is [w], [w,h], [w,h,c], [w,h,d,c]
+
+|type|A|B|C|
+|---|---|---|---|
+|1|[1]|scalar|[1]|
+|2|[1]|[2]|[2]|
+|3|[1]|[2,3]|[2,3]|
+|4|[1]|[2,3,4]|[2,3,4]|
+|5|[2]|scalar|[2]|
+|6|[2]|[1]|[2]|
+|7|[2]|[2]|[2]|
+|8|[3]|[2,3]|[2,3]|
+|9|[4]|[2,3,4]|[2,3,4]|
+|10|[2,3]|scalar|[2,3]|
+|11|[2,3]|[1]|[2,3]|
+|12|[2,3]|[3]|[2,3]|
+|13|[2,3]|[2,3]|[2,3]|
+|14|[3,4]|[2,3,4]|[2,3,4]|
+|15|[2,3,4]|scalar|[2,3,4]|
+|16|[2,3,4]|[1]|[2,3,4]|
+|17|[2,3,4]|[4]|[2,3,4]|
+|18|[2,3,4]|[3,4]|[2,3,4]|
+|19|[2,3,4]|[2,3,4]|[2,3,4]|
+|20|[1]|[2,3,4,5]|[2,3,4,5]|
+|21|[5]|[2,3,4,5]|[2,3,4,5]|
+|22|[4,5]|[2,3,4,5]|[2,3,4,5]|
+|23|[3,4,5]|[2,3,4,5]|[2,3,4,5]|
+|24|[2,3,4,5]|scalar|[2,3,4,5]|
+|25|[2,3,4,5]|[1]|[2,3,4,5]|
+|26|[2,3,4,5]|[5]|[2,3,4,5]|
+|27|[2,3,4,5]|[4,5]|[2,3,4,5]|
+|28|[2,3,4,5]|[3,4,5]|[2,3,4,5]|
+|29|[2,3,4,5]|[2,3,4,5]|[2,3,4,5]|
+
+some special broadcasting rule exists for model compatibility
+
+|special type|A|B|C|
+|---|---|---|---|
+|1|[2,3,4]|[1,1,4]|[2,3,4]|
+|2|[2,3,4]|[2,3,1]|[2,3,4]|
+|3|[1,1,4]|[2,3,4]|[2,3,4]|
+|4|[2,3,1]|[2,3,4]|[2,3,4]|
+|5|[2,3,4]|[1,3,4]|[2,3,4]|
+|6|[2,3,4]|[2,1,4]|[2,3,4]|
+|7|[1,3,4]|[2,3,4]|[2,3,4]|
+|8|[2,1,4]|[2,3,4]|[2,3,4]|
--- a/3rdparty/ncnn/docs/developer-guide/custom-allocator.md
+++ b/3rdparty/ncnn/docs/developer-guide/custom-allocator.md
@ -0,0 +1,63 @@
+Mat structure is now allocator-aware via an extra allocator parameter with default zero value.
+
+The good-old ncnn::fastMalloc()/ncnn::fastFree() will be used for a null allocator.
+
+You could pass a custom allocator to delegate all memory allocation and deallocation.
+
+```cpp
+class Allocator
+{
+public:
+    virtual void* fastMalloc(size_t size) = 0;
+    virtual void fastFree(void* ptr) = 0;
+};
+```
+
+ncnn has already implemented two simple pooled Allocator class, with mutex lock or without it.
+
+```cpp
+ncnn::PoolAllocator locked_mempool;
+ncnn::UnlockedPoolAllocator unlocked_mempool;
+```
+
+the two allocator types in ncnn
+
+* blob allocator
+
+    used to allocate memory for all named blobs, which you could retrieve by Extractor::extract()
+* workspace allocator
+
+    used to allocate memory for internal temporary use in layer implementation, such as the temp blob after padding in convolution
+
+by default, all Extractor instance use the two allocator in the default option
+You can alter them by ncnn::set_default_option()
+or you can set them per Extractor by Extractor::set_blob_allocator()/Extractor::set_workspace_allocator()
+
+blob allocator is guaranteed to be called in-order in layer implementation during each Extractor lifecycle
+while workspace allocator may be called synchronously
+
+the practical usage
+
+* one network, one-by-one inference
+
+    shared unlocked blob allocator for all Extractor
+
+    shared locked workspace allocator for all Extractor
+
+* one network, concurrent inference
+
+    shared unlocked blob allocator for all Extractor in each thread
+
+    shared locked workspace allocator for all Extractor among all threads
+
+* concurrent multiple networks, one-by-one inference for each network
+
+    shared unlocked blob allocator for all Extractor of each network
+
+    shared locked workspace allocator for all Extractor among all networks (for saving memory)
+
+* concurrent multiple networks, concurrent inference for each network
+
+    shared unlocked blob allocator for all Extractor of each network in each thread
+
+    shared locked workspace allocator for all Extractor among all networks (for saving memory)
--- a/3rdparty/ncnn/docs/developer-guide/element-packing.md
+++ b/3rdparty/ncnn/docs/developer-guide/element-packing.md
@ -0,0 +1,119 @@
+### what is packing and why
+
+packing is the form of storing multiple short-sized values as one long-sized value.
+
+element packing is well mapped with the underlying simd register, which usually use one very wide register to store different types of values.
+
+|C|elemsize|elempack|
+|---|---|---|
+|double|8|1|
+|float|4|1|
+|int|4|1|
+|short|2|1|
+|signed char|1|1|
+
+|arm neon|elemsize|elempack|
+|---|---|---|
+|float64x2_t|16|2|
+|float32x4_t|16|4|
+|int32x4_t|16|4|
+|float16x4_t|8|4|
+|int8x8_t|8|8|
+
+Though the real count of values doubles when elempack is two, the wide-sized value is still treated as one value in the view of Mat structure. For example, we want to store 40 float values in Mat object, if elempack 1 is used, Mat width is then 40, while 10 if elempack 4 is used.
+
+|dims|w|h|c|cstep|elemsize|elempack|
+|---|---|---|---|---|---|---|
+|1|40|1|1|40|4|1|
+|1|10|1|1|10|16|4|
+
+### packing style convention
+
+In practice, elempack 1, 4, 8 are the most common cases. It is possible to use any other packing style in theory.
+
+The following table show the packing axis used in ncnn for different dimension.
+
+|dims|packing axis|shape before packing|shape after packing|
+|---|---|---|---|
+|1|w|w|w/elempack|
+|2|h|w, h|w, h/elempack|
+|3|c|w, h, c|w, h, c/elempack|
+
+If the packing axis dim is not evenly divisible by elempack, zero padding may be used.
+
+```
+outw = (w + elempack - 1) / elempack;
+```
+
+The following snippet shows the memory layout after elempack=4 on 3-dim Mat
+
+```
+// w=2 h=3 c=4 elempack=1
+0 1
+2 3
+4 5
+
+6 7
+8 9
+10 11
+
+12 13
+14 15
+16 17
+
+18 19
+20 21
+22 23
+
+// w=2 h=3 c=1 elempack=4
+(0,6,12,18) (1,7,13,19)
+(2,8,14,20) (3,9,15,21)
+(4,10,16,22) (5,11,17,23)
+```
+
+### how to convert elempack
+
+There is a convenient wrapper function provided
+```
+// convert to elempack 4 if packing axis dim is evenly divisible by elempack
+// return the identity Mat otherwise
+ncnn::Mat a;
+ncnn::Mat a_packed;
+ncnn::convert_packing(a, a_packed, 4);
+if (a_packed.elempack == 4)
+{
+    // check if packing is successful
+}
+
+// convert to packing 1, aka unpacking, shall be always successful
+ncnn::Mat b;
+ncnn::Mat b_unpacked;
+ncnn::convert_packing(b, b_unpacked, 1);
+```
+
+### handle general interleaved data
+
+Here is an example of using convert packing to convert RGB interleaved data to planar
+
+**NOTE:** The following code is just presented to explain what packing is and the conversion process. Do not use it in production due to its poor performance. Do use ncnn::Mat::from_pixels()
+
+```cpp
+// rgb_interleaved_u8 is RGB RGB RGB ...
+// rgb_interleaved_u8.w = w;
+// rgb_interleaved_u8.h = h;
+// rgb_interleaved_u8.c = 1;
+// rgb_interleaved_u8.elemsize = 3;
+// rgb_interleaved_u8.elempack = 3;
+
+ncnn::Mat rgb_interleaved_u8(w, h, 1, 3, 3);
+ncnn::Mat rgb_planar_u8;
+
+ncnn::convert_packing(rgb_interleaved_u8, rgb_planar_u8, 1);
+
+// rgb_planar_u8 is now RRR ... GGG ... BBB ...
+// rgb_planar_u8.w = w;
+// rgb_planar_u8.h = h;
+// rgb_planar_u8.c = 3;
+// rgb_planar_u8.elemsize = 1;
+// rgb_planar_u8.elempack = 1;
+```
--- a/3rdparty/ncnn/docs/developer-guide/how-to-be-a-contributor.zh.md
+++ b/3rdparty/ncnn/docs/developer-guide/how-to-be-a-contributor.zh.md
@ -0,0 +1,75 @@
+### 如何提交代码
+
+#### 一、fork 分支
+在浏览器中打开 [ncnn](https://github.com/tencent/ncnn), `fork` 到自己的 repositories，例如
+```
+https://github.com/user/ncnn
+```
+
+clone 项目到本地，添加官方 remote 并 fetch:
+```
+$ git clone https://github.com/user/ncnn && cd ncnn
+$ git remote add tencent https://github.com/tencent/ncnn
+$ git fetch tencent
+```
+对于 `git clone` 下来的项目，它现在有两个 remote，分别是 origin 和 tencent：
+
+```
+$ git remote -v
+origin   https://github.com/user/ncnn (fetch)
+origin   https://github.com/user/ncnn (push)
+tencent  https://github.com/Tencent/ncnn (fetch)
+tencent  https://github.com/Tencent/ncnn (push)
+```
+origin 指向你 fork 的仓库地址；remote 即官方 repo。可以基于不同的 remote 创建和提交分支。
+
+例如切换到官方 master 分支，并基于此创建自己的分支（命名尽量言简意赅。一个分支只做一件事，方便 review 和 revert）
+```
+$ git checkout tencent/master
+$ git checkout -b add-conv-int8
+```
+
+或创建分支时指定基于官方 master 分支：
+```
+$ git checkout -b fix-typo-in-document tencent/master
+```
+
+> `git fetch` 是从远程获取最新代码到本地。如果是第二次 pr ncnn，直接从  `git fetch tencent` 开始即可，不需要 `git remote add tencent`，也不需要修改 `github.com/user/ncnn`。
+
+#### 二、代码习惯
+为了增加沟通效率，reviewer 一般要求 contributor 遵从以下规则
+
+* `if-else`和花括号`{`中间需要换行
+* 不能随意增删空行
+* tab 替换为 4 个空格
+* 为了保证平台兼容性，目前不使用`c++11`，`src`目录下尽量避免使用`template`
+* 若是新增功能或平台，`test`目录需有对应测试用例
+* 文档放到`doc`对应目录下，中文用`.zh.md`做后缀；英文直接用`.md`后缀
+
+开发完成后提交到自己的 repository
+```
+$ git commit -a
+$ git push origin add-conv-int8
+```
+推荐使用 [`commitizen`](https://pypi.org/project/commitizen/) 或 [`gitlint`](https://jorisroovers.com/gitlint/) 等工具格式化 commit message，方便事后检索海量提交记录
+
+#### 三、代码提交
+浏览器中打开 [ncnn pulls](https://github.com/Tencent/ncnn/pulls) ，此时应有此分支 pr 提示，点击 `Compare & pull request`
+
+* 标题**必须**是英文。未完成的分支应以 `WIP:` 开头，例如 `WIP: add conv int8`
+* 正文宜包含以下内容，中英不限
+    * 内容概述和实现方式
+    * 功能或性能测试
+    * 测试结果
+
+CI 已集成了自动格式化，restyled-io 会在 pr 的同时生成 `Restyled add conv int8`，需要 merge 自动 restyled 的分支，例如
+```
+$ git fetch tencent
+$ git checkout add-conv-int8
+$ git merge tencent/restyled/pull-2078
+$ git push origin add-conv-int8
+```
+回到浏览器签署  CLA，所有 CI 测试通过后通知 reviewer merge 此分支。
+
+#### 四、彩蛋
+留下个人 qq 号会触发隐藏事件。
--- a/3rdparty/ncnn/docs/developer-guide/how-to-implement-custom-layer-step-by-step.md
+++ b/3rdparty/ncnn/docs/developer-guide/how-to-implement-custom-layer-step-by-step.md
@ -0,0 +1,323 @@
+# step1 create a new empty class
+```cpp
+// mylayer.h
+#include "layer.h"
+using namespace ncnn;
+
+// a new layer type called MyLayer
+class MyLayer : public Layer
+{
+};
+
+// mylayer.cpp
+#include "mylayer.h"
+DEFINE_LAYER_CREATOR(MyLayer)
+```
+
+# step2 declare layer parameters and weights
+```cpp
+// mylayer.h
+#include "layer.h"
+using namespace ncnn;
+
+class MyLayer : public Layer
+{
+private:
+    int channels;// new code
+    float gamma;// new code
+    Mat weight;// new code
+};
+
+// mylayer.cpp
+#include "mylayer.h"
+DEFINE_LAYER_CREATOR(MyLayer)
+```
+
+# step3 implement load functions for parameters and weights
+```cpp
+// mylayer.h
+#include "layer.h"
+using namespace ncnn;
+
+class MyLayer : public Layer
+{
+public:
+    virtual int load_param(const ParamDict& pd);// new code
+    virtual int load_model(const ModelBin& mb);// new code
+
+private:
+    int channels;
+    float eps;
+    Mat gamma_data;
+};
+
+// mylayer.cpp
+#include "mylayer.h"
+DEFINE_LAYER_CREATOR(MyLayer)
+
+// new routine for loading parameters
+int MyLayer::load_param(const ParamDict& pd)
+{
+    // details about the relations with param file
+    // https://github.com/Tencent/ncnn/wiki/param-and-model-file-structure
+    //
+    channels = pd.get(0, 0);// parse 0=<int value> entry, default value 0
+    eps = pd.get(1, 0.001f);// parse 1=<float value> entry, default value 0.001f
+
+    return 0;// return zero if success
+}
+
+// new routine for loading weights
+int MyLayer::load_model(const ModelBin& mb)
+{
+    // details about the relations with model file
+    // https://github.com/Tencent/ncnn/wiki/param-and-model-file-structure
+    //
+    // read weights with length of channels * sizeof(float)
+    // the second argument explains as follows
+    // 0 judge the value type automatically, you may get float or float16 or uint8 etc
+    //   depends on the model storage and the supporting target hardware
+    // 1 read float values anyway
+    // 2 read float16 values anyway
+    // 3 read uint8 values anyway
+    gamma_data = mb.load(channels, 1);
+    if (gamma_data.empty())
+        return -100;// return non-zero on error, -100 indicates out-of-memory
+
+    return 0;// return zero if success
+}
+```
+
+# step4 determine forward behavior
+```cpp
+// mylayer.h
+#include "layer.h"
+using namespace ncnn;
+
+class MyLayer : public Layer
+{
+public:
+    MyLayer();// new code
+    virtual int load_param(const ParamDict& pd);
+    virtual int load_model(const ModelBin& mb);
+
+private:
+    int channels;
+    float eps;
+    Mat gamma_data;
+};
+
+// mylayer.cpp
+#include "mylayer.h"
+DEFINE_LAYER_CREATOR(MyLayer)
+
+// new routine for setting forward behavior
+MyLayer::MyLayer()
+{
+    // one input and one output
+    // typical one_blob_only type: Convolution, Pooling, ReLU, Softmax ...
+    // typical non-one_blob_only type: Eltwise, Split, Concat, Slice ...
+    one_blob_only = true;
+
+    // do not change the blob size, modify data in-place
+    // typical support_inplace type: ReLU, Sigmoid ...
+    // typical non-support_inplace type: Convolution, Pooling ...
+    support_inplace = true;
+}
+
+int MyLayer::load_param(const ParamDict& pd)
+{
+    channels = pd.get(0, 0);
+    eps = pd.get(1, 0.001f);
+
+    // you could alter the behavior based on loaded parameter
+    // if (eps == 0.001f)
+    // {
+    //     one_blob_only = false;
+    //     support_inplace = false;
+    // }
+
+    return 0;
+}
+
+int MyLayer::load_model(const ModelBin& mb)
+{
+    gamma_data = mb.load(channels, 1);
+    if (gamma_data.empty())
+        return -100;
+
+    // you could alter the behavior based on loaded weight
+    // if (gamma_data[0] == 0.f)
+    // {
+    //     one_blob_only = false;
+    //     support_inplace = false;
+    // }
+
+    return 0;
+}
+```
+
+# step5 choose proper interface based on forward behavior
+```cpp
+// The base class Layer defines four interfaces for each forward behavior combination
+
+// 1
+virtual int forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const;
+
+// 2
+virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;
+
+// 3
+virtual int forward_inplace(std::vector<Mat>& bottom_top_blobs, const Option& opt) const;
+
+// 4
+virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;
+```
+**must** = layer must implement this function
+
+**optional** = layer may implement this function for optimal performance
+
+sometimes the graph inference path cannot call forward_inplace directly due to data sharing, in this situation the non-inplace forward routine will be used, which deep-copy the input blob and call inplace forward on it if the optional routine is not implemented. Thus, you could avoid this deep-copy by process input to output on-the-fly.
+
+|one_blob_only|support_inplace|1|2|3|4|
+|---|---|---|---|---|---|
+|false|false|must| | | |
+|false|true|optional| |must| |
+|true|false| |must| | |
+|true|true| |optional| |must|
+
+# step6 implement forward function
+```cpp
+// mylayer.h
+#include "layer.h"
+using namespace ncnn;
+
+class MyLayer : public Layer
+{
+public:
+    MyLayer();
+    virtual int load_param(const ParamDict& pd);
+    virtual int load_model(const ModelBin& mb);
+
+    virtual int forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const;// new code, optional
+    virtual int forward_inplace(Mat& bottom_top_blob, const Option& opt) const;// new code
+
+private:
+    int channels;
+    float eps;
+    Mat gamma_data;
+};
+
+// mylayer.cpp
+#include "mylayer.h"
+DEFINE_LAYER_CREATOR(MyLayer)
+
+MyLayer::MyLayer()
+{
+    one_blob_only = true;
+    support_inplace = true;
+}
+
+int MyLayer::load_param(const ParamDict& pd)
+{
+    channels = pd.get(0, 0);
+    eps = pd.get(1, 0.001f);
+
+    return 0;
+}
+
+int MyLayer::load_model(const ModelBin& mb)
+{
+    gamma_data = mb.load(channels, 1);
+    if (gamma_data.empty())
+        return -100;
+
+    return 0;
+}
+
+// optional new routine for layer forward function, non-inplace version
+int MyLayer::forward(const Mat& bottom_blob, Mat& top_blob, const Option& opt) const
+{
+    // check input dims, return non-zero on error
+    if (bottom_blob.c != channels)
+        return -1;
+
+    // x = (x + eps) * gamma_per_channel
+
+    int w = bottom_blob.w;
+    int h = bottom_blob.h;
+    size_t elemsize = bottom_blob.elemsize;
+    int size = w * h;
+
+    top_blob.create(w, h, channels, elemsize, opt.blob_allocator);
+    if (top_blob.empty())
+        return -100;// return non-zero on error, -100 indicates out-of-memory
+
+    #pragma omp parallel for num_threads(opt.num_threads)
+    for (int q=0; q<channels; q++)
+    {
+        const float* ptr = bottom_blob.channel(q);
+        float* outptr = top_blob.channel(q);
+        const float gamma = gamma_data[q];
+
+        for (int i=0; i<size; i++)
+        {
+            outptr[i] = (ptr[i] + eps) * gamma ;
+        }
+    }
+
+    return 0;
+}
+
+// new routine for layer forward function
+int MyLayer::forward_inplace(Mat& bottom_top_blob, const Option& opt) const
+{
+    // check input dims, return non-zero on error
+    if (bottom_top_blob.c != channels)
+        return -1;
+
+    // x = (x + eps) * gamma_per_channel
+
+    int w = bottom_top_blob.w;
+    int h = bottom_top_blob.h;
+    int size = w * h;
+
+    #pragma omp parallel for num_threads(opt.num_threads)
+    for (int q=0; q<channels; q++)
+    {
+        float* ptr = bottom_top_blob.channel(q);
+        const float gamma = gamma_data[q];
+
+        for (int i=0; i<size; i++)
+        {
+            ptr[i] = (ptr[i] + eps) * gamma ;
+        }
+    }
+
+    return 0;
+}
+```
+
+# step7 integrate with ncnn library
+you may probably need to modify caffe2ncnn or mxnet2ncnn etc. to write your layer specific parameters and weights into ncnn param and model file
+
+the param and model file structure [param-and-model-file-structure](param-and-model-file-structure)
+
+```
+// example param file content
+Input            input   0 1 input
+Convolution      conv2d  1 1 input conv2d 0=32 1=1 2=1 3=1 4=0 5=0 6=768
+MyLayer          mylayer 1 1 conv2d mylayer0
+Pooling          maxpool 1 1 mylayer0 maxpool 0=0 1=3 2=2 3=-233 4=0
+```
+
+```cpp
+ncnn::Net net;
+
+// register custom layer before load param and model
+// the layer creator function signature is always XYZ_layer_creator, which defined in DEFINE_LAYER_CREATOR macro
+net.register_custom_layer("MyLayer", MyLayer_layer_creator);
+
+net.load_param("model.param");
+net.load_model("model.bin");
+```
--- a/3rdparty/ncnn/docs/developer-guide/how-to-write-a-neon-optimized-op-kernel.md
+++ b/3rdparty/ncnn/docs/developer-guide/how-to-write-a-neon-optimized-op-kernel.md
@ -0,0 +1,38 @@
+# benchmark
+op
+
+# naive C with openmp
+for for for
+
+# unroll, first try
+h
+
+# register allocation
+kernels
+
+# unroll, second try
+simd
+
+# neon intrinsics
+optional
+
+# naive neon assembly with pld
+asm
+
+# pipeline optimize, first try
+more register load mla
+
+# pipeline optimize, second try
+interleave load mla
+
+# pipeline optimize, third try
+loop tail
+
+# usual practice, load/save
+233
+
+# usual practice, unroll
+233
+
+# usual practice, save register
+233
--- a/3rdparty/ncnn/docs/developer-guide/low-level-operation-api.md
+++ b/3rdparty/ncnn/docs/developer-guide/low-level-operation-api.md
@ -0,0 +1,311 @@
+# implement elementwise addition with/without broadcast using BinaryOp operation
+
+* input must be fp32 storage without packing
+* output is expected to be fp32 storage without packing
+
+```cpp
+void binary_add(const ncnn::Mat& a, const ncnn::Mat& b, ncnn::Mat& c)
+{
+    ncnn::Option opt;
+    opt.num_threads = 2;
+    opt.use_fp16_storage = false;
+    opt.use_packing_layout = false;
+
+    ncnn::Layer* op = ncnn::create_layer("BinaryOp");
+
+    // set param
+    ncnn::ParamDict pd;
+    pd.set(0, 0);// op_type
+
+    op->load_param(pd);
+
+    op->create_pipeline(opt);
+
+    // forward
+    std::vector<ncnn::Mat> bottoms(2);
+    bottoms[0] = a;
+    bottoms[1] = b;
+
+    std::vector<ncnn::Mat> tops(1);
+    op->forward(bottoms, tops, opt);
+
+    c = tops[0];
+
+    op->destroy_pipeline(opt);
+
+    delete op;
+}
+```
+
+# implement 3x3 box blur on three channel image using ConvolutionDepthWise operation
+
+* input must be fp32 storage without packing
+* output is expected to be fp32 storage without packing
+
+```cpp
+void convolution_3x3_boxblur_RGB(const ncnn::Mat& rgb, ncnn::Mat& out)
+{
+    ncnn::Option opt;
+    opt.num_threads = 2;
+    opt.use_fp16_storage = false;
+    opt.use_packing_layout = false;
+
+    ncnn::Layer* op = ncnn::create_layer("ConvolutionDepthWise");
+
+    // set param
+    ncnn::ParamDict pd;
+    pd.set(0, 3);// num_output
+    pd.set(1, 3);// kernel_w
+    pd.set(5, 0);// bias_term
+    pd.set(6, 3*3*3);// weight_data_size
+    pd.set(7, 3);// group
+
+    op->load_param(pd);
+
+    // set weights
+    ncnn::Mat weights[1];
+    weights[0].create(3*3*3);// weight_data
+
+    for (int i=0; i<3*3*3; i++)
+    {
+        weights[0][i] = 1.f / 9;
+    }
+
+    op->load_model(ncnn::ModelBinFromMatArray(weights));
+
+    op->create_pipeline(opt);
+
+    // forward
+    op->forward(rgb, out, opt);
+
+    op->destroy_pipeline(opt);
+
+    delete op;
+}
+```
+# transpose Mat, chw to cwh
+
+* input must be fp32 storage with/without packing
+* output is expected to be fp32 storage packed
+
+```cpp
+void transpose(const ncnn::Mat& in, ncnn::Mat& out)
+{
+    ncnn::Option opt;
+    opt.num_threads = 2;
+    opt.use_fp16_storage = false;
+    opt.use_packing_layout = true;
+
+    ncnn::Layer* op = ncnn::create_layer("Permute");
+
+    // set param
+    ncnn::ParamDict pd;
+    pd.set(0, 1);// order_type
+
+    op->load_param(pd);
+
+    op->create_pipeline(opt);
+
+    ncnn::Mat in_packed = in;
+    {
+        // resolve dst_elempack
+        int dims = in.dims;
+        int elemcount = 0;
+        if (dims == 1) elemcount = in.elempack * in.w;
+        if (dims == 2) elemcount = in.elempack * in.h;
+        if (dims == 3) elemcount = in.elempack * in.c;
+
+        int dst_elempack = 1;
+        if (op->support_packing)
+        {
+            if (elemcount % 8 == 0 && (ncnn::cpu_support_x86_avx2() || ncnn::cpu_support_x86_avx()))
+                dst_elempack = 8;
+            else if (elemcount % 4 == 0)
+                dst_elempack = 4;
+        }
+
+        if (in.elempack != dst_elempack)
+        {
+            convert_packing(in, in_packed, dst_elempack, opt);
+        }
+    }
+
+    // forward
+    op->forward(in_packed, out, opt);
+
+    op->destroy_pipeline(opt);
+
+    delete op;
+}
+```
+# apply instance normalization
+// x = (x - mean) / sqrt(var)
+
+* input can be fp32/fp16 storage with/without packing
+* output is expected to be fp16 storage packed when supported, or fp32 storage packed otherwise
+
+```cpp
+void normalize(const ncnn::Mat& in, ncnn::Mat& out)
+{
+    ncnn::Option opt;
+    opt.num_threads = 2;
+    opt.use_fp16_storage = true;
+    opt.use_packing_layout = true;
+
+    ncnn::Layer* op = ncnn::create_layer("InstanceNorm");
+
+    // set param
+    ncnn::ParamDict pd;
+    pd.set(0, in.c);// channels
+    pd.set(1, 0.f);// eps
+
+    op->load_param(pd);
+
+    // set weights
+    ncnn::Mat weights[2];
+    weights[0].create(in.c);// gamma_data
+    weights[1].create(in.c);// beta_data
+
+    weights[0].fill(1.f);
+    weights[1].fill(0.f);
+
+    op->load_model(ncnn::ModelBinFromMatArray(weights));
+
+    op->create_pipeline(opt);
+
+    ncnn::Mat in_fp16 = in;
+    if (in.elembits() == 32 && op->support_fp16_storage)
+    {
+        cast_float32_to_float16(in, in_fp16, opt);
+    }
+    if (in.elembits() == 16 && !op->support_fp16_storage)
+    {
+        cast_float16_to_float32(in, in_fp16, opt);
+    }
+
+    ncnn::Mat in_fp16_packed = in_fp16;
+    {
+        // resolve dst_elempack
+        int dims = in_fp16.dims;
+        int elemcount = 0;
+        if (dims == 1) elemcount = in_fp16.elempack * in_fp16.w;
+        if (dims == 2) elemcount = in_fp16.elempack * in_fp16.h;
+        if (dims == 3) elemcount = in_fp16.elempack * in_fp16.c;
+
+        int dst_elempack = 1;
+        if (op->support_packing)
+        {
+            if (elemcount % 8 == 0 && (ncnn::cpu_support_x86_avx2() || ncnn::cpu_support_x86_avx()))
+                dst_elempack = 8;
+            else if (elemcount % 4 == 0)
+                dst_elempack = 4;
+        }
+
+        if (in_fp16.elempack != dst_elempack)
+        {
+            convert_packing(in_fp16, in_fp16_packed, dst_elempack, opt);
+        }
+    }
+
+    // forward
+    op->forward(in_fp16_packed, out, opt);
+
+    op->destroy_pipeline(opt);
+
+    delete op;
+}
+```
+
+# cpu -> gpu -> forward -> gpu -> cpu
+
+```cpp
+ncnn::VulkanDevice* vkdev = ncnn::get_gpu_device();
+
+ncnn::VkAllocator* blob_vkallocator = vkdev->acquire_blob_allocator();
+ncnn::VkAllocator* staging_vkallocator = vkdev->acquire_staging_allocator();
+
+ncnn::VkWeightAllocator* weight_vkallocator = new ncnn::VkWeightAllocator(vkdev);
+ncnn::VkWeightStagingAllocator* weight_staging_vkallocator = new ncnn::VkWeightStagingAllocator(vkdev);
+
+// create layer
+ncnn::Layer* convolution = ncnn::create_layer("Convolution");
+convolution->vkdev = vkdev;
+
+// set option
+ncnn::Option opt;
+opt.num_threads = 4;
+opt.use_vulkan_compute = true;
+opt.blob_vkallocator = blob_vkallocator;
+opt.workspace_vkallocator = blob_vkallocator;
+opt.staging_vkallocator = staging_vkallocator;
+
+// load param
+{
+    ncnn::ParamDict pd;
+    pd.set(0, outch);
+    pd.set(1, ksize);
+    pd.set(6, outch*inch*ksize*ksize);
+    pd.use_vulkan_compute = 1;
+
+    convolution->load_param(pd);
+}
+
+// load model
+{
+    ncnn::Mat weights[2];
+    weights[0] = random_mat(outch*inch*ksize*ksize);
+    weights[1] = random_mat(outch);
+
+    ncnn::ModelBinFromMatArray mb(weights);
+    convolution->load_model(mb);
+}
+
+// create pipeline
+convolution->create_pipeline(opt);
+
+// upload model
+{
+    ncnn::VkTransfer cmd(vkdev);
+
+    ncnn::Option opt_upload = opt;
+    opt_upload.blob_vkallocator = weight_vkallocator;
+    opt_upload.workspace_vkallocator = weight_vkallocator;
+    opt_upload.staging_vkallocator = weight_staging_vkallocator;
+
+    convolution->upload_model(cmd, opt_upload);
+
+    cmd.submit_and_wait();
+}
+
+ncnn::Mat bottom = random_mat(w, h, inch);
+
+ncnn::Mat top;
+
+// forward
+{
+    ncnn::VkCompute cmd(vkdev);
+
+    ncnn::VkMat bottom_gpu;
+    cmd.record_upload(bottom, bottom_gpu, opt);
+
+    ncnn::VkMat top_gpu;
+    convolution->forward(bottom_gpu, top_gpu, cmd, opt);
+
+    cmd.record_download(top_gpu, top, opt);
+
+    cmd.submit_and_wait();
+}
+
+convolution->destroy_pipeline(opt);
+
+delete convolution;
+
+vkdev->reclaim_blob_allocator(blob_vkallocator);
+vkdev->reclaim_staging_allocator(staging_vkallocator);
+
+weight_vkallocator->clear();
+weight_staging_vkallocator->clear();
+delete weight_vkallocator;
+delete weight_staging_vkallocator;
+```
+
--- a/3rdparty/ncnn/docs/developer-guide/ncnn-tips-and-tricks.zh.md
+++ b/3rdparty/ncnn/docs/developer-guide/ncnn-tips-and-tricks.zh.md
@ -0,0 +1,46 @@
+### blob内存是隐含共享的
+
+ncnn的blob最初直接使用opencv的cv::Mat，后发现blob最多只支持三维，因此实现了类似的Mat
+Mat的data每个通道内存16字节对齐，并且有原子的引用计数，a=b不复制数据，超级快
+Mat支持直接引用外部的内存块，不复制数据，加快模型加载和输入输出
+
+举个例子：split layer 将一个blob复制成n个，ncnn中实现为单纯的增加引用计数，没有任何数据复制
+
+### 只运算一部分并保留中间结果
+
+ncnn的net在解决分支依赖时是自上而下深度优先的，因此当网络有多个分支时，运算只会在需要结果的那个分支中进行，节约时间
+当多个分支有重合部分时，运算其中一个分支后会自动保留其余分支所需的中间结果，隐含共享，以便运算其余分支时利用
+
+举个例子：某网络结构为 A -> B -> C1 + C2，向ncnn索要C1结果时，运算过程是 A -> B -> C1，同时B结果引用计数加1自动保留，后面还需要C2结果时，只运算C2就足够了
+
+### 开启轻模式省内存
+
+每个layer都会产生blob，除了最后的结果和多分支中间结果，大部分blob都不值得保留，开启轻模式可以在运算后自动回收，省下内存
+
+举个例子：某网络结构为 A -> B -> C，在轻模式下，向ncnn索要C结果时，A结果会在运算B时自动回收，而B结果会在运算C时自动回收，最后只保留C结果，后面再需要C结果会直接获得，满足绝大部分深度网络的使用方式
+
+### 网络和运算是分开的
+
+ncnn的net是网络模型，实际使用的是extractor，也就是同个net可以有很多个运算实例，而且运算实例互不影响，中间结果保留在extractor内部，在多线程使用时共用网络的结构和参数数据，初始化网络模型和参数只需要一遍
+
+举个例子：全局静态的net实例，初始化一次后，就能不停地生成extractor使用
+
+### openmp虽快但未必合适
+
+ncnn中几乎所有运算都能用上openmp多线程加速，而且性能很赞
+不过系统有时候会突然慢一下，比如手机太热自动降频，界面操作等等，ncnn耗时也会偶尔抖动变长，在计算耗时稳定性比较重要的时候建议关闭openmp，或者设置下extractor线程数
+
+举个例子：手机自拍时，用ncnn进行人脸实时定位，如果耗时突然涨一下就会感觉到掉帧，而稳定的帧率体验更好
+
+### NCNN_STDIO/NCNN_STRING禁用模型文件
+
+ncnn支持加载自有的模型文件和模型内存，NCNN_STDIO控制是否需要支持加载模型文件，设成0能禁用这部分代码，从而减小库的体积，NCNN_STRING设成0能清除大部分可见的字符串和解析过程
+模型内存加载时的参数数据是直接引用的，速度更快，通常在手机上使用这种方式
+
+### 削减 ncnn 内置的层实现
+
+cmake的时候，加参数 -DWITH_LAYER_xxx=OFF 就可以完全不编译对应的内置层，这样可以进一步减小库的体积
+
+### 关于 ARM big.LITTLE 调度
+
+调用set_cpu_powersave可以把ncnn运算线程控制在特定的cpu核心上，大核心速度快耗电多，小核心速度慢点但省电，大小一起用手机热得快
--- a/3rdparty/ncnn/docs/developer-guide/new-model-load-api.md
+++ b/3rdparty/ncnn/docs/developer-guide/new-model-load-api.md
@ -0,0 +1,194 @@
+## current model load api
+### Cons
+#### long and awful code
+#### two functions
+#### deal float32 float16 quantized-u8
+#### deal alignment size
+```cpp
+#if NCNN_STDIO
+int Convolution::load_model(FILE* binfp)
+{
+    int nread;
+
+    union
+    {
+        struct
+        {
+            unsigned char f0;
+            unsigned char f1;
+            unsigned char f2;
+            unsigned char f3;
+        };
+        unsigned int tag;
+    } flag_struct;
+
+    nread = fread(&flag_struct, sizeof(flag_struct), 1, binfp);
+    if (nread != 1)
+    {
+        fprintf(stderr, "Convolution read flag_struct failed %d\n", nread);
+        return -1;
+    }
+
+    unsigned int flag = flag_struct.f0 + flag_struct.f1 + flag_struct.f2 + flag_struct.f3;
+
+    weight_data.create(weight_data_size);
+    if (weight_data.empty())
+        return -100;
+
+    if (flag_struct.tag == 0x01306B47)
+    {
+        // half-precision weight data
+        int align_weight_data_size = alignSize(weight_data_size * sizeof(unsigned short), 4);
+        std::vector<unsigned short> float16_weights;
+        float16_weights.resize(align_weight_data_size);
+        nread = fread(float16_weights.data(), align_weight_data_size, 1, binfp);
+        if (nread != 1)
+        {
+            fprintf(stderr, "Convolution read float16_weights failed %d\n", nread);
+            return -1;
+        }
+
+        weight_data = Mat::from_float16(float16_weights.data(), weight_data_size);
+        if (weight_data.empty())
+            return -100;
+    }
+    else if (flag != 0)
+    {
+        // quantized weight data
+        float quantization_value[256];
+        nread = fread(quantization_value, 256 * sizeof(float), 1, binfp);
+        if (nread != 1)
+        {
+            fprintf(stderr, "Convolution read quantization_value failed %d\n", nread);
+            return -1;
+        }
+
+        int align_weight_data_size = alignSize(weight_data_size * sizeof(unsigned char), 4);
+        std::vector<unsigned char> index_array;
+        index_array.resize(align_weight_data_size);
+        nread = fread(index_array.data(), align_weight_data_size, 1, binfp);
+        if (nread != 1)
+        {
+            fprintf(stderr, "Convolution read index_array failed %d\n", nread);
+            return -1;
+        }
+
+        float* weight_data_ptr = weight_data;
+        for (int i = 0; i < weight_data_size; i++)
+        {
+            weight_data_ptr[i] = quantization_value[ index_array[i] ];
+        }
+    }
+    else if (flag_struct.f0 == 0)
+    {
+        // raw weight data
+        nread = fread(weight_data, weight_data_size * sizeof(float), 1, binfp);
+        if (nread != 1)
+        {
+            fprintf(stderr, "Convolution read weight_data failed %d\n", nread);
+            return -1;
+        }
+    }
+
+    if (bias_term)
+    {
+        bias_data.create(num_output);
+        if (bias_data.empty())
+            return -100;
+        nread = fread(bias_data, num_output * sizeof(float), 1, binfp);
+        if (nread != 1)
+        {
+            fprintf(stderr, "Convolution read bias_data failed %d\n", nread);
+            return -1;
+        }
+    }
+
+    return 0;
+}
+#endif // NCNN_STDIO
+
+int Convolution::load_model(const unsigned char*& mem)
+{
+    union
+    {
+        struct
+        {
+            unsigned char f0;
+            unsigned char f1;
+            unsigned char f2;
+            unsigned char f3;
+        };
+        unsigned int tag;
+    } flag_struct;
+
+    memcpy(&flag_struct, mem, sizeof(flag_struct));
+    mem += sizeof(flag_struct);
+
+    unsigned int flag = flag_struct.f0 + flag_struct.f1 + flag_struct.f2 + flag_struct.f3;
+
+    if (flag_struct.tag == 0x01306B47)
+    {
+        // half-precision weight data
+        weight_data = Mat::from_float16((unsigned short*)mem, weight_data_size);
+        mem += alignSize(weight_data_size * sizeof(unsigned short), 4);
+        if (weight_data.empty())
+            return -100;
+    }
+    else if (flag != 0)
+    {
+        // quantized weight data
+        const float* quantization_value = (const float*)mem;
+        mem += 256 * sizeof(float);
+
+        const unsigned char* index_array = (const unsigned char*)mem;
+        mem += alignSize(weight_data_size * sizeof(unsigned char), 4);
+
+        weight_data.create(weight_data_size);
+        if (weight_data.empty())
+            return -100;
+        float* weight_data_ptr = weight_data;
+        for (int i = 0; i < weight_data_size; i++)
+        {
+            weight_data_ptr[i] = quantization_value[ index_array[i] ];
+        }
+    }
+    else if (flag_struct.f0 == 0)
+    {
+        // raw weight data
+        weight_data = Mat(weight_data_size, (float*)mem);
+        mem += weight_data_size * sizeof(float);
+    }
+
+    if (bias_term)
+    {
+        bias_data = Mat(num_output, (float*)mem);
+        mem += num_output * sizeof(float);
+    }
+
+    return 0;
+}
+```
+
+## new model load api proposed
+### Pros
+#### clean and simple api
+#### element type detection
+```cpp
+int Convolution::load_model(const ModelBin& mb)
+{
+    // auto detect element type
+    weight_data = mb.load(weight_data_size, 0);
+    if (weight_data.empty())
+        return -100;
+
+    if (bias_term)
+    {
+        // certain type specified
+        bias_data = mb.load(num_output, 1);
+        if (bias_data.empty())
+            return -100;
+    }
+
+    return 0;
+}
+```
--- a/3rdparty/ncnn/docs/developer-guide/new-param-load-api.md
+++ b/3rdparty/ncnn/docs/developer-guide/new-param-load-api.md
@ -0,0 +1,92 @@
+## current param load api
+### Cons
+#### long and awful code
+#### three functions
+#### not extensible
+#### no default value
+#### no variable length array
+```
+MyLayer  mylayer 1 1 in out 100 1.250000
+```
+```
+binary 100
+binary 1.250000
+```
+```cpp
+#if NCNN_STDIO
+#if NCNN_STRING
+int MyLayer::load_param(FILE* paramfp)
+{
+    int nscan = fscanf(paramfp, "%d %f", &a, &b);
+    if (nscan != 2)
+    {
+        fprintf(stderr, "MyLayer load_param failed %d\n", nscan);
+        return -1;
+    }
+
+    return 0;
+}
+#endif // NCNN_STRING
+int MyLayer::load_param_bin(FILE* paramfp)
+{
+    fread(&a, sizeof(int), 1, paramfp);
+
+    fread(&b, sizeof(float), 1, paramfp);
+
+    return 0;
+}
+#endif // NCNN_STDIO
+
+int MyLayer::load_param(const unsigned char*& mem)
+{
+    a = *(int*)(mem);
+    mem += 4;
+
+    b = *(float*)(mem);
+    mem += 4;
+
+    return 0;
+}
+```
+
+## new param load api proposed
+### Pros
+#### clean and simple api
+#### default value
+#### extensible
+#### variable length array
+```
+7767517
+MyLayer  mylayer 1 1 in out 0=100 1=1.250000 -23303=5,0.1,0.2,0.4,0.8,1.0
+```
+```
+binary 0xDD857600(magic)
+
+binary 0
+binary 100
+binary 1
+binary 1.250000
+binary -23303
+binary 5
+binary 0.1
+binary 0.2
+binary 0.4
+binary 0.8
+binary 1.0
+binary -233(EOP)
+```
+```cpp
+int MyLayer::load_param(const ParamDict& pd)
+{
+    // pd.get( param id (seq), default value );
+    a = pd.get(0, 100);
+    b = pd.get(1, 1.25f);
+
+    // get default value for c if not specified in param file
+    c = pd.get(2, 0.001);
+
+    // get array
+    d = pd.get(3, Mat(len, array));
+    return 0;
+}
+```
--- a/3rdparty/ncnn/docs/developer-guide/operation-param-weight-table.md
+++ b/3rdparty/ncnn/docs/developer-guide/operation-param-weight-table.md
@ -0,0 +1,303 @@
+
+|operation|param id|param phase|default value|weight order|
+|:---:|:---:|:---:|:---:|:---:|
+|AbsVal|||
+|ArgMax|0|out_max_val|0|
+||1|topk|1|
+|BatchNorm|0|channels|0|slope mean variance bias|
+||1|eps|0.f|
+|Bias|0|bias_data_size|0|
+|BinaryOp|0|op_type|0|
+||1|with_scalar|0|
+||2|b|0.f|
+|BNLL|||
+|Cast|0|type_from|0|
+||1|type_to|0|
+|Clip|0|min|-FLT_MAX|
+||1|max|FLT_MAX|
+|Concat|0|axis|0|
+|Convolution|0|num_output|0|weight bias|
+||1|kernel_w|0|
+||2|dilation_w|1|
+||3|stride_w|1|
+||4|pad_left|0|
+||5|bias_term|0|
+||6|weight_data_size|0|
+||8|int8_scale_term|0|
+||9|activation_type|0|
+||10|activation_params|[ ]|
+||11|kernel_h|kernel_w|
+||12|dilation_h|dilation_w|
+||13|stride_h|stride_w|
+||15|pad_right|pad_left|
+||14|pad_top|pad_left|
+||16|pad_bottom|pad_top|
+||17|impl_type|0|
+||18|pad_value|0.f|
+|ConvolutionDepthWise|0|num_output|0|weight bias|
+||1|kernel_w|0|
+||2|dilation_w|1|
+||3|stride_w|1|
+||4|pad_left|0|
+||5|bias_term|0|
+||6|weight_data_size|0|
+||7|group|1|
+||8|int8_scale_term|0|
+||9|activation_type|0|
+||10|activation_params|[ ]|
+||11|kernel_h|kernel_w|
+||12|dilation_h|dilation_w|
+||13|stride_h|stride_w|
+||15|pad_right|pad_left|
+||14|pad_top|pad_left|
+||16|pad_bottom|pad_top|
+||18|pad_value|0.f|
+|Crop|0|woffset|0|
+||1|hoffset|0|
+||2|coffset|0|
+||3|outw|0|
+||4|outh|0|
+||5|outc|0|
+||6|woffset2|0|
+||7|hoffset2|0|
+||8|coffset2|0|
+||9|starts|[ ]|
+||10|ends|[ ]|
+||11|axes|[ ]|
+|Deconvolution|0|num_output|0|weight bias|
+||1|kernel_w|0|
+||2|dilation_w|1|
+||3|stride_w|1|
+||4|pad_left|0|
+||5|bias_term|0|
+||6|weight_data_size|0|
+||9|activation_type|0|
+||10|activation_params|[ ]|
+||11|kernel_h|kernel_w|
+||12|dilation_h|dilation_w|
+||13|stride_h|stride_w|
+||15|pad_right|pad_left|
+||14|pad_top|pad_left|
+||16|pad_bottom|pad_top|
+||18|output_pad_right|0|
+||19|output_pad_bottom|output_pad_right|
+||20|output_w|0|
+||21|output_h|output_w|
+|DeconvolutionDepthWise|0|num_output|0|weight bias|
+||1|kernel_w|0|
+||2|dilation_w|1|
+||3|stride_w|1|
+||4|pad_left|0|
+||5|bias_term|0|
+||6|weight_data_size|0|
+||7|group|1|
+||9|activation_type|0|
+||10|activation_params|[ ]|
+||11|kernel_h|kernel_w|
+||12|dilation_h|dilation_w|
+||13|stride_h|stride_w|
+||15|pad_right|pad_left|
+||14|pad_top|pad_left|
+||16|pad_bottom|pad_top|
+||18|output_pad_right|0|
+||19|output_pad_bottom|output_pad_right|
+||20|output_w|0|
+||21|output_h|output_w|
+|Dequantize|0|scale|1.f|bias|
+||1|bias_term|0|
+||2|bias_data_size|0|
+|DetectionOutput|0|num_class|0|
+||1|nms_threshold|0.05f|
+||2|nms_top_k|300|
+||3|keep_top_k|100|
+||4|confidence_threshold|0.5f|
+||5|variances[0]|0.1f|
+||6|variances[1]|0.1f|
+||7|variances[2]|0.2f|
+||8|variances[3]|0.2f|
+|Dropout|0|scale|1.f|
+|Eltwise|0|op_type|0|
+||1|coeffs|[ ]|
+|ELU|0|alpha|0.1f|
+|Embed|0|num_output|0|weight bias|
+||1|input_dim|0|
+||2|bias_term|0|
+||3|weight_data_size|0|
+|Exp|0|base|-1.f|
+||1|scale|1.f|
+||2|shift|0.f|
+|ExpandDims|0|expand_w|0|
+||1|expand_h|0|
+||2|expand_c|0|
+||3|axes|[ ]|
+|Flatten|||
+|HardSigmoid|0|alpha|0.2f||
+||1|beta|0.5f|
+|HardSwish|0|alpha|0.2f||
+||1|beta|0.5f|
+|InnerProduct|0|num_output|0|weight bias|
+||1|bias_term|0|
+||2|weight_data_size|0|
+||8|int8_scale_term|0|
+||9|activation_type|0|
+||10|activation_params|[ ]|
+|Input|0|w|0|
+||1|h|0|
+||2|c|0|
+|InstanceNorm|0|channels|0|gamma bias|
+||1|eps|0.001f|
+|Interp|0|resize_type|0|
+||1|height_scale|1.f|
+||2|width_scale|1.f|
+||3|output_height|0|
+||4|output_width|0|
+|Log|0|base|-1.f|
+||1|scale|1.f|
+||2|shift|0.f|
+|LRN|0|region_type|0|
+||1|local_size|5|
+||2|alpha|1.f|
+||3|beta|0.75f|
+||4|bias|1.f|
+|LSTM|0|num_output|0|
+||1|weight_data_size|1|
+||2|direction|0|
+|MemoryData|0|w|0|
+||1|h|0|
+||2|c|0|
+|Mish|||
+|MVN|0|normalize_variance|0|
+||1|across_channels|0|
+||2|eps|0.0001f|
+|Noop|||
+|Normalize|0|across_spatial|0|scale|
+||4|across_channel|0|
+||1|channel_shared|0|
+||2|eps|0.0001f|
+||9|eps_mode|0|
+||3|scale_data_size|0|
+|Packing|0|out_packing|1|
+||1|use_padding|0|
+||2|cast_type_from|0|
+||3|cast_type_to|0|
+||4|storage_type_from|0|
+||5|storage_type_to|0|
+|Padding|0|top|0|per_channel_pad_data|
+||1|bottom|0|
+||2|left|0|
+||3|right|0|
+||4|type|0|
+||5|value|0.f|
+||6|per_channel_pad_data_size|0|
+||7|front|0|
+||8|behind|0|
+|Permute|0|order_type|0|
+|PixelShuffle|0|upscale_factor|1|
+|Pooling|0|pooling_type(0: max 1: avg)|0|
+||1|kernel_w|0|
+||11|kernel_h|kernel_w|
+||2|stride_w|1|
+||12|stride_h|stride_w|
+||3|pad_left|0|
+||14|pad_right|pad_left|
+||13|pad_top|pad_left|
+||15|pad_bottom|pad_top|
+||4|global_pooling|0|
+||5|pad_mode|0|
+|Power|0|power|1.f|
+||1|scale|1.f|
+||2|shift|0.f|
+|PReLU|0|num_slope|0|slope|
+|PriorBox|0|min_sizes|[ ]|
+||1|max_sizes|[ ]|
+||2|aspect_ratios|[ ]|
+||3|varainces[0]|0.f|
+||4|varainces[1]|0.f|
+||5|varainces[2]|0.f|
+||6|varainces[3]|0.f|
+||7|flip|1|
+||8|clip|0|
+||9|image_width|0|
+||10|image_height|0|
+||11|step_width|-233.f|
+||12|step_height|-233.f|
+||13|offset|0.f|
+||14|step_mmdetection|0|
+||15|center_mmdetection|0|
+|Proposal|0|feat_stride|16|
+||1|base_size|16|
+||2|pre_nms_topN|6000|
+||3|after_nms_topN|300|
+||4|num_thresh|0.7f|
+||5|min_size|16|
+|PSROIPooling|0|pooled_width|7|
+||1|pooled_height|7|
+||2|spatial_scale|0.0625f|
+||3|output_dim|0|
+|Quantize|0|scale|1.f|
+|Reduction|0|operation|0|
+||1|dim|0|
+||2|coeff|1.f|
+||3|axes|[ ]|
+||4|keepdims|0|
+|ReLU|0|slope|0.f|
+|Reorg|0|stride|0|
+|Requantize|0|scale_in|1.f|bias|
+||1|scale_out|1.f|
+||2|bias_term|0|
+||3|bias_data_size|0|
+||4|fusion_relu|0|
+|Reshape|0|w|-233|
+||1|h|-233|
+||2|c|-233|
+||3|permute|0|
+|ROIAlign|0|pooled_width|0|
+||1|pooled_height|0|
+||2|spatial_scale|1.f|
+||3|sampling_ratio|0|
+||4|aligned|0|
+||5|version|0|
+|ROIPooling|0|pooled_width|0|
+||1|pooled_height|0|
+||2|spatial_scale|1.f|
+|Scale|0|scale_data_size|0|scale bias|
+||1|bias_term|0|
+|SELU|0|alpha|1.67326324f||
+||1|lambda|1.050700987f|
+|ShuffleChannel|0|group|1|
+|Sigmoid|||
+|Slice|0|slices|[ ]|
+||1|axis|0|
+|Softmax|0|axis|0|
+|Split|||
+|SPP|0|pooling_type|0|
+||1|pyramid_height|1|
+|Squeeze|0|squeeze_w|0|
+||1|squeeze_h|0|
+||2|squeeze_c|0|
+||3|axes|[ ]|
+|StatisticsPooling|0|include_stddev|0|
+|Swish|||
+|TanH|||
+|Threshold|0|threshold|0.f|
+|Tile|0|dim|0|
+||1|tiles|1|
+|UnaryOp|0|op_type|0|
+|YoloDetectionOutput|0|num_class|20|
+||1|num_box|5|
+||2|confidence_threshold|0.01f|
+||3|num_threshold|0.45f|
+||4|biases|[]|
+|Yolov3DetectionOutput|0|num_class|20|
+||1|num_box|5|
+||2|confidence_threshold|0.01f|
+||3|num_threshold|0.45f|
+||4|biases|[]|
+||5|mask|[]|
+||6|anchors_scale|[]|
+|RNN|0|num_output|0|
+||1|weight_data_size|0|
+||2|direction|0|
+|MultiHeadAttention|0|embed_dim|0|
+||1|num_head|1|
+||2|weight_data_size|0|
--- a/3rdparty/ncnn/docs/developer-guide/operators.md
+++ b/3rdparty/ncnn/docs/developer-guide/operators.md
--- a/3rdparty/ncnn/docs/developer-guide/param-and-model-file-structure.md
+++ b/3rdparty/ncnn/docs/developer-guide/param-and-model-file-structure.md
@ -0,0 +1,64 @@
+## net.param
+### example
+```
+7767517
+3 3
+Input         input    0 1 data 0=4 1=4 2=1
+InnerProduct  ip       1 1 data fc 0=10 1=1 2=80
+Softmax       softmax  1 1 fc prob 0=0
+```
+### overview
+```
+[magic]
+```
+* magic number : 7767517
+```
+[layer count] [blob count]
+```
+* layer count : count of the layer line follows, should be exactly the count of all layer names
+* blob count : count of all blobs, usually greater than or equals to the layer count
+### layer line
+```
+[layer type] [layer name] [input count] [output count] [input blobs] [output blobs] [layer specific params]
+```
+* layer type : type name, such as Convolution Softmax etc
+* layer name : name of this layer, must be unique among all layer names
+* input count : count of the blobs this layer needs as input
+* output count : count of the blobs this layer produces as output
+* input blobs : name list of all the input blob names, separated by space, must be unique among input blob names of all layers
+* output blobs : name list of all the output blob names, separated by space, must be unique among output blob names of all layers
+* layer specific params : key=value pair list, separated by space
+### layer param
+```
+0=1 1=2.5 -23303=2,2.0,3.0
+```
+key index should be unique in each layer line, pair can be omitted if the default value used
+
+the meaning of existing param key index can be looked up at [operation-param-weight-table](operation-param-weight-table)
+
+* integer or float key : index 0 ~ 19
+* integer value : int
+* float value : float
+* integer array or float array key : -23300 minus index 0 ~ 19
+* integer array value : [array size],int,int,...,int
+* float array value : [array size],float,float,...,float
+
+## net.bin
+```
+  +---------+---------+---------+---------+---------+---------+
+  | weight1 | weight2 | weight3 | weight4 | ....... | weightN |
+  +---------+---------+---------+---------+---------+---------+
+  ^         ^         ^         ^
+  0x0      0x80      0x140     0x1C0
+```
+the model binary is the concatenation of all weight data, each weight buffer is aligned by 32bit
+
+### weight buffer
+```
+[flag] (optional)
+[raw data]
+[padding] (optional)
+```
+* flag : unsigned int,  little-endian, indicating the weight storage type, 0 => float32, 0x01306B47 => float16, otherwise => quantized int8, may be omitted if the layer implementation forced the storage type explicitly
+* raw data : raw weight data, little-endian, float32 data or float16 data or quantized table and indexes depending on the storage type flag
+* padding : padding space for 32bit alignment, may be omitted if already aligned
--- a/3rdparty/ncnn/docs/developer-guide/preload-practice.zh.md
+++ b/3rdparty/ncnn/docs/developer-guide/preload-practice.zh.md
@ -0,0 +1,29 @@
+## 只是实践经验，没有理论，不一定正确
+
+```
+prfm pldl1keep, [x0, #256]
+```
+* 放在 ld1 [x0] 前面 0~8 条指令
+* #256 表示把 x0+256 的内容放进 L1 cache
+* ldp 也适用
+* (经验)不写 offset 不如写个 #128
+* (经验)pldl1strm 似乎没啥意思，也没 pldl1keep 快
+* (经验)x0 ~ x0+256 的内容也会进来
+* (经验)load 128bit 用 #128，256bit或更多用 #256
+* (经验)避免 pld a，pld b，load a，load b 顺序，可能相互干扰
+* (经验)提前太多会失效
+* (经验)适合连续读
+
+```
+prfm pldl2strm, [x0, #256]
+```
+* 放在 ld1 [x0] 前面 N 条指令，N 尽量大些
+* #256 表示把 x0+256 的内容放进 L2 cache
+* ldp 也适用
+* (经验)不写 offset 不如写个 #128
+* (经验)pldl2strm 效果稍好于 pldl2keep
+* (经验)x0 ~ x0+256 的内容也会进来
+* (经验)load 128bit 用 #128，256bit 用 #256
+* (经验)读很多数据，用不同 offset 连续两次 pldl2strm
+* (经验)后面不要对同位置再 pldl1keep，会变慢
+* (经验)适合提前准备要跳到很远的地方读，比如换 channel
--- a/3rdparty/ncnn/docs/developer-guide/tensorflow-op-combination.md
+++ b/3rdparty/ncnn/docs/developer-guide/tensorflow-op-combination.md
@ -0,0 +1,57 @@
+## batchnorm
+```
+Input       A            0 1 A 0 0 0
+MemoryData  sub/y        0 1 sub/y 16 0 0
+BinaryOp    sub          2 1 A sub/y sub 1
+MemoryData  div/y        0 1 div/y 16 0 0
+BinaryOp    div          2 1 sub div/y div 3
+MemoryData  mul/y        0 1 mul/y 16 0 0
+BinaryOp    mul          2 1 div mul/y mul 2
+MemoryData  BiasAdd/bias 0 1 BiasAdd/bias 16 0 0
+BinaryOp    BiasAdd      2 1 mul BiasAdd/bias BiasAdd 0
+```
+## convolution
+```
+Input       A            0 1 A 0 0 0
+Convolution Conv2D       1 1 A Conv2D 10 3 1 1 0 0 270
+MemoryData  biases/read  0 1 biases/read 10 0 0
+BinaryOp    BiasAdd      2 1 Conv2D biases/read BiasAdd 0
+```
+## innerproduct
+```
+Input        A           0 1 A 0 0 0
+MemoryData   biases/read 0 1 biases/read 10 0 0
+InnerProduct MatMul      1 1 A MatMul 10 0 2560
+BinaryOp     conv6       2 1 MatMul biases/read conv6 0
+```
+## leakyrelu
+```
+Input       A            0 1 A 0 0 0
+Split       splitncnn_0  1 2 A A_splitncnn_0 A_splitncnn_1
+MemoryData  mul_1/x      0 1 mul_1/x 0 0 0
+BinaryOp    mul_1        2 1 mul_1/x A_splitncnn_1 mul_1 2
+BinaryOp    leaky        2 1 mul_1 A_splitncnn_0 leaky 4
+```
+## prelu
+```
+Input       A            0 1 A 0 0 0
+Split       splitncnn_0  1 2 A A_splitncnn_0 A_splitncnn_1
+MemoryData  prelu/alpha  0 1 prelu/alpha 10 0 0
+ReLU        prelu/Relu   1 1 A_splitncnn_1 prelu/Relu 0.000000
+UnaryOp     prelu/Neg    1 1 A_splitncnn_0 prelu/Neg 1
+ReLU        prelu/Relu_1 1 1 prelu/Neg prelu/Relu_1 0.000000
+UnaryOp     prelu/Neg_1  1 1 prelu/Relu_1 prelu/Neg_1 1
+BinaryOp    prelu/Mul    2 1 prelu/alpha prelu/Neg_1 prelu/Mul 2
+BinaryOp    prelu/add    2 1 prelu/Relu prelu/Mul prelu/add 0
+```
+## softmax
+```
+Input       A            0 1 A 0 0 0
+Split       splitncnn_4  1 2 A A_splitncnn_0 A_splitncnn_1
+Reduction   Max          1 1 A_splitncnn_1 Max 4 -2 1.000000
+BinaryOp    sub          2 1 A_splitncnn_0 Max sub 1
+UnaryOp     Exp          1 1 sub Exp 7
+Split       splitncnn_5  1 2 Exp Exp_splitncnn_0 Exp_splitncnn_1
+Reduction   Sum          1 1 Exp_splitncnn_1 Sum 0 -2 1.000000
+BinaryOp    prob         2 1 Exp_splitncnn_0 Sum prob 3
+```
--- a/3rdparty/ncnn/docs/faq.md
+++ b/3rdparty/ncnn/docs/faq.md
@ -0,0 +1,676 @@
+
+
+# 如何加入技术交流QQ群？
+
+- 打开QQ→点击群聊搜索→搜索群号637093648→输入问题答案：卷卷卷卷卷→进入群聊→准备接受图灵测试（bushi）
+- 前往QQ搜索Pocky群：677104663(超多大佬)，问题答案：multi level intermediate representation
+
+# 如何看作者b站直播？
+
+- nihui的bilibili直播间：[水竹院落](https://live.bilibili.com/1264617)
+
+# 编译
+
+- ## 怎样下载完整源码？
+
+   git clone --recursive https://github.com/Tencent/ncnn/
+   
+   或者
+   
+   下载 [ncnn-xxxxx-full-source.zip](https://github.com/Tencent/ncnn/releases)
+
+- ## 怎么交叉编译？cmake 工具链怎么设置啊？
+  
+   参见 https://github.com/Tencent/ncnn/wiki/how-to-build
+
+- ## The submodules were not downloaded! Please update submodules with "git submodule update --init" and try again
+
+   如上，下载完整源码。或者按提示执行: git submodule update --init
+
+- ## Could NOT find Protobuf (missing: Protobuf_INCLUDE_DIR)
+  
+   sudo apt-get install libprotobuf-dev protobuf-compiler
+
+- ## Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY)
+
+   https://github.com/Tencent/ncnn/issues/1873
+
+- ## Could not find a package configuration file provided by "OpenCV" with any of the following names: OpenCVConfig.cmake opencv-config.cmake
+
+   sudo apt-get install libopencv-dev
+
+   或者自行编译安装，set(OpenCV_DIR {OpenCVConfig.cmake所在目录})
+
+- ## Could not find a package configuration file provided by "ncnn" with any of the following names: ncnnConfig.cmake ncnn-config.cmake
+
+   set(ncnn_DIR {ncnnConfig.cmake所在目录})
+
+- ## 找不到 Vulkan, 
+
+   cmake版本 3.10，否则没有带 FindVulkan.cmake
+
+   android-api >= 24
+
+   macos 要先执行安装脚本
+
+- ## 如何安装 vulkan sdk
+
+- ## 找不到库（需要根据系统/编译器指定）
+
+   undefined reference to __kmpc_for_static_init_4 __kmpc_for_static_fini __kmpc_fork_call ...
+
+   需要链接openmp库 
+
+   undefined reference to vkEnumerateInstanceExtensionProperties vkGetInstanceProcAddr vkQueueSubmit ...
+
+   需要 vulkan-1.lib
+
+   undefined reference to glslang::InitializeProcess() glslang::TShader::TShader(EShLanguage) ...
+
+   需要 glslang.lib OGLCompiler.lib SPIRV.lib OSDependent.lib
+
+   undefined reference to AAssetManager_fromJava AAssetManager_open AAsset_seek ...
+
+   find_library和target_like_libraries中增加 android 
+
+   find_package(ncnn)
+
+- ## undefined reference to typeinfo for ncnn::Layer
+
+   opencv rtti -> opencv-mobile
+
+- ## undefined reference to __cpu_model
+
+   升级编译器 / libgcc_s libgcc
+
+- ## unrecognized command line option "-mavx2"
+
+   升级 gcc
+
+- ## 为啥自己编译的ncnn android库特别大？
+
+   https://github.com/Tencent/ncnn/wiki/build-for-android.zh 以及见 如何裁剪更小的 ncnn 库
+
+- ## ncnnoptimize和自定义层
+
+   先ncnnoptimize再增加自定义层，避免ncnnoptimize不能处理自定义层保存。
+
+
+- ## rtti/exceptions冲突
+
+   产生原因是项目工程中使用的库配置不一样导致冲突，根据自己的实际情况分析是需要开启还是关闭。ncnn默认是ON，在重新编译ncnn时增加以下2个参数即可：
+   - 开启：-DNCNN_DISABLE_RTTI=OFF -DNCNN_DISABLE_EXCEPTION=OFF
+   - 关闭：-DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON
+
+
+- ## error: undefined symbol: ncnn::Extractor::extract(char const*, ncnn::Mat&)
+
+   可能的情况：
+   - 尝试升级 Android Studio 的 NDK 版本
+
+
+# 怎样添加ncnn库到项目中？cmake方式怎么用？
+
+编译ncnn，make install。linux/windows set/export ncnn_DIR 指向 isntall目录下下包含ncnnConfig.cmake 的目录
+
+- ## android
+
+- ## ios
+
+- ## linux
+
+- ## windows
+
+- ## macos
+
+- ## arm linux
+
+
+# 转模型问题
+
+- ## caffe
+
+   `./caffe2ncnn caffe.prototxt caffe.caffemodel ncnn.param ncnn.bin`
+
+- ## mxnet
+
+   ` ./mxnet2ncnn mxnet-symbol.json mxnet.params ncnn.param ncnn.bin`
+
+- ## darknet
+
+   [https://github.com/xiangweizeng/darknet2ncnn](https://github.com/xiangweizeng/darknet2ncnn)
+
+- ## pytorch - onnx
+
+   [use ncnn with pytorch or onnx](https://github.com/Tencent/ncnn/wiki/use-ncnn-with-pytorch-or-onnx)
+
+- ## tensorflow 1.x/2.x - keras
+
+   [https://github.com/MarsTechHAN/keras2ncnn](https://github.com/MarsTechHAN/keras2ncnn) **[@MarsTechHAN](https://github.com/MarsTechHAN)**
+
+- ## tensorflow 2.x - mlir
+
+   [通过MLIR将tensorflow2模型转换到ncnn](https://zhuanlan.zhihu.com/p/152535430) **@[nihui](https://www.zhihu.com/people/nihui-2)**
+
+- ## Shape not supported yet! Gather not supported yet! Cast not supported yet!
+
+   onnx-simplifier 静态shape
+
+- ## convertmodel
+
+   [https://convertmodel.com/](https://convertmodel.com/) **[@大老师](https://github.com/daquexian)**
+
+- ## netron
+
+   [https://github.com/lutzroeder/netron](https://github.com/lutzroeder/netron)
+
+- ## 怎么生成有固定 shape 信息的模型？
+
+   Input      0=w 1=h 2=c
+
+- ## why gpu能更快
+
+- ## ncnnoptimize 怎么转成 fp16 模型
+
+   `ncnnoptimize model.param model.bin yolov5s-opt.param yolov5s-opt.bin 65536`
+
+- ## ncnnoptimize 怎样查看模型的 FLOPS / 内存占用情况
+
+- ## 怎么修改模型支持动态 shape？
+
+   Interp Reshape
+
+- ## 如何将模型转换为代码内嵌到程序里？
+
+   ncnn2mem
+
+- ## 如何加密模型？
+
+   https://zhuanlan.zhihu.com/p/268327784
+
+- ## Linux下转的ncnn模型，Windows/MacOS/Android/.. 也能直接用吗？
+
+   Yes，全平台通用
+
+- ## 如何去掉后处理，再导出 onnx？
+
+   检测：
+
+   参考up的一篇文章<https://zhuanlan.zhihu.com/p/128974102>，步骤三就是去掉后处理,再导出onnx,其中去掉后处理可以是项目内测试时去掉后续步骤的结果。
+
+- ## pytorch 有的层导不出 onnx 怎么办？
+
+ 方式一:
+
+   ONNX_ATEN_FALLBACK
+完全自定义的op，先改成能导出的（如 concat slice），转到 ncnn 后再修改 param
+
+ 方式二：
+
+ 可以使用PNNX来试试，参考以下文章大概说明:
+
+   1. [Windows/Linux/macOS 编译 PNNX 步骤](https://zhuanlan.zhihu.com/p/431833958)
+
+   2. [5分钟学会！用 PNNX 转换 TorchScript 模型到 ncnn 模型](https://zhuanlan.zhihu.com/p/427512763)
+
+# 使用
+
+- ## vkEnumeratePhysicalDevices failed -3
+
+- ## vkCreateInstance failed -9
+
+   出现此类问题请先更新GPU驱动。Please upgrade your GPU driver if you encounter this crash or error.
+   这里提供了一些品牌的GPU驱动下载网址.We have provided some drivers' download pages here.
+   [Intel](https://downloadcenter.intel.com/product/80939/Graphics-Drivers)，[AMD](https://www.amd.com/en/support)，[Nvidia](https://www.nvidia.com/Download/index.aspx)
+
+- ## ModuleNotFoundError: No module named 'ncnn.ncnn'
+
+   python setup.py develop
+
+- ## fopen nanodet-m.param failed
+
+   文件路径 working dir
+
+   File not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.
+
+- ## find_blob_index_by_name data / output / ... failed
+
+   layer name vs blob name
+   
+   param.bin 应该用 xxx.id.h 的枚举
+
+- ## parse magic failed
+
+- ## param is too old, please regenerate
+
+   模型本身有问题
+
+   Your model file is being the old format converted by an old caffe2ncnn tool.
+
+   Checkout the latest ncnn code, build it and regenerate param and model binary files, and that should work.
+
+   Make sure that your param file starts with the magic number 7767517.
+
+   you may find more info on use-ncnn-with-alexnet
+   
+   When adding the softmax layer yourself, you need to add 1=1
+
+- ## set_vulkan_compute failed, network use_vulkan_compute disabled
+
+   你应该在 load_param / load_model 之前设置 net.opt.use_vulkan_compute = true;
+
+- ## 多个blob输入，多个blob输出，怎么做？
+   多次执行`ex.input()` 和 `ex.extract()`
+```
+ex.input("data1", in_1);
+ex.input("data2", in_2);
+ex.extract("output1", out_1);
+ex.extract("output2", out_2);
+```
+- ## Extractor extract 多次会重复计算吗？
+
+   不会
+
+- ## 如何看每一层的耗时？
+
+   cmake -DNCNN_BENCHMARK=ON ..
+
+- ## 如何转换 cv::Mat CV_8UC3 BGR 图片
+
+   from_pixels to_pixels
+
+- ## 如何转换 float 数据为 ncnn::Mat
+
+   首先，自己申请的内存需要自己管理，此时ncnn::Mat不会自动给你释放你传过来的float数据
+   ``` c++
+   std::vector<float> testData(60, 1.0);                                // 利用std::vector<float>自己管理内存的申请和释放
+   ncnn::Mat in1(60, (void*)testData.data()).reshape(4, 5, 3);          // 把float数据的指针转成void*传过去即可，甚至还可以指定维度(up说最好使用reshape用来解决channel gap)
+   float* a = new float[60];                                            // 自己new一块内存，后续需要自己释放
+   ncnn::Mat in2 = ncnn::Mat(60, (void*)a).reshape(4, 5, 3).clone();    // 使用方法和上面相同，clone() to transfer data owner
+   ```
+
+- ## 如何初始化 ncnn::Mat 为全 0
+
+   `mat.fill(0.f);`
+
+- ## 如何查看／获取版本号
+
+   cmake时会打印
+
+   c_api.h ncnn_version()
+
+   自己拼 1.0+yyyymmdd
+
+- ## 如何转换 yuv 数据
+
+   yuv420sp2rgb yuv420sp2rgb_nv12
+
+   **[@metarutaiga](https://github.com/metarutaiga/xxYUV)**
+
+- ## 如何 resize crop rotate 图片
+
+   [efficient roi resize rotate](https://github.com/Tencent/ncnn/wiki/efficient-roi-resize-rotate)
+
+- ## 如何人脸5点对齐
+
+   get_affine_transform
+
+   warpaffine_bilinear_c3
+
+```c
+// 计算变换矩阵 并且求逆变换
+int type = 0;       // 0->区域外填充为v[0],v[1],v[2], -233->区域外不处理
+unsigned int v = 0;
+float tm[6];
+float tm_inv[6];
+// 人脸区域在原图上的坐标和宽高
+float src_x = target->det.rect.x / target->det.w * pIveImageU8C3->u32Width;
+float src_y = target->det.rect.y / target->det.h * pIveImageU8C3->u32Height;
+float src_w = target->det.rect.w / target->det.w * pIveImageU8C3->u32Width;
+float src_h = target->det.rect.h / target->det.h * pIveImageU8C3->u32Height;
+float point_src[10] = {
+src_x + src_w * target->attr.land[0][0], src_x + src_w * target->attr.land[0][1],
+src_x + src_w * target->attr.land[1][0], src_x + src_w * target->attr.land[1][1],
+src_x + src_w * target->attr.land[2][0], src_x + src_w * target->attr.land[2][1],
+src_x + src_w * target->attr.land[3][0], src_x + src_w * target->attr.land[3][1],
+src_x + src_w * target->attr.land[4][0], src_x + src_w * target->attr.land[4][1],
+};
+float point_dst[10] = { // +8 是因为我们处理112*112的图
+30.2946f + 8.0f, 51.6963f,
+65.5318f + 8.0f, 51.5014f,
+48.0252f + 8.0f, 71.7366f,
+33.5493f + 8.0f, 92.3655f,
+62.7299f + 8.0f, 92.2041f,
+};
+// 第一种方式：先计算变换在求逆
+AffineTrans::get_affine_transform(point_src, point_dst, 5, tm);
+AffineTrans::invert_affine_transform(tm, tm_inv);
+// 第二种方式：直接拿到求逆的结果
+// AffineTrans::get_affine_transform(point_dst, point_src, 5, tm_inv);
+// rgb 分离的，所以要单独处理
+for(int c = 0; c < 3; c++)
+{
+    unsigned char* pSrc = malloc(xxx);
+    unsigned char* pDst = malloc(xxx);
+    ncnn::warpaffine_bilinear_c1(pSrc, SrcWidth, SrcHeight, SrcStride[c], pDst, DstWidth, DstHeight, DstStride[c], tm_inv, type, v);
+}
+// rgb packed则可以一次处理
+ncnn::warpaffine_bilinear_c3(pSrc, SrcWidth, SrcHeight, SrcStride, pDst, DstWidth, DstHeight, DstStride, tm_inv, type, v);
+```
+
+- ## 如何获得中间层的blob输出
+  
+   ncnn::Mat output;
+   
+   ex.extract("your_blob_name", output);
+
+- ## 为什么我使用GPU，但是GPU占用为0
+
+   windows 10 任务管理器 - 性能选项卡 - GPU - 选择其中一个视图左上角的下拉箭头切换到 Compute_0 / Compute_1 / Cuda
+
+   你还可以安装软件：GPU-Z 
+
+- ## layer XYZ not exists or registered
+
+   Your network contains some operations that are not implemented in ncnn.
+
+   You may implement them as custom layer followed in how-to-implement-custom-layer-step-by-step.
+
+   Or you could simply register them as no-op if you are sure those operations make no sense.
+
+```
+class Noop : public ncnn::Layer {};
+DEFINE_LAYER_CREATOR(Noop)
+
+net.register_custom_layer("LinearRegressionOutput", Noop_layer_creator);
+net.register_custom_layer("MAERegressionOutput", Noop_layer_creator);
+```
+
+- ## network graph not ready
+
+   You shall call Net::load_param() first, then Net::load_model().
+
+   This error may also happens when Net::load_param() failed, but not properly handled.
+
+   For more information about the ncnn model load api, see ncnn-load-model
+
+- ## memory not 32-bit aligned at XYZ
+
+   The pointer passed to Net::load_param() or Net::load_model() is not 32bit aligned.
+
+   In practice, the head pointer of std::vector is not guaranteed to be 32bit aligned.
+
+   you can store your binary buffer in ncnn::Mat structure, its internal memory is aligned.
+
+- ## crash on android with '__kmp_abort_process'
+
+   This usually happens if you bundle multiple shared library with openmp linked
+
+   It is actually an issue of the android ndk https://github.com/android/ndk/issues/1028
+
+   On old android ndk, modify the link flags as
+
+   -Wl,-Bstatic -lomp -Wl,-Bdynamic
+
+   For recent ndk >= 21
+
+   -fstatic-openmp
+
+- ## dlopen failed: library "libomp.so" not found
+   Newer android ndk defaults to dynamic openmp runtime
+
+   modify the link flags as
+
+   -fstatic-openmp -fopenmp
+
+- ## crash when freeing a ncnn dynamic library(.dll/.so) built with openMP
+
+   for optimal performance, the openmp threadpool spin waits for about a second prior to shutting down in case more work becomes available.
+
+   If you unload a dynamic library that's in the process of spin-waiting, it will crash in the manner you see (most of the time).
+
+   Just set OMP_WAIT_POLICY=passive in your environment, before calling loadlibrary. or Just wait a few seconds before calling freelibrary.
+
+   You can also use the following method to set environment variables in your code:
+
+   for msvc++:
+
+      SetEnvironmentVariable(_T("OMP_WAIT_POLICY"), _T("passive"));
+
+   for g++:
+
+      setenv("OMP_WAIT_POLICY", "passive", 1)
+   
+      reference: https://stackoverflow.com/questions/34439956/vc-crash-when-freeing-a-dll-built-with-openmp
+
+# 跑出来的结果对不上
+
+[ncnn-produce-wrong-result](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result)
+
+- ## 如何打印 ncnn::Mat 的值？
+
+```C++
+void pretty_print(const ncnn::Mat& m)
+{
+    for (int q=0; q<m.c; q++)
+    {
+        const float* ptr = m.channel(q);
+        for (int y=0; y<m.h; y++)
+        {
+            for (int x=0; x<m.w; x++)
+            {
+                printf("%f ", ptr[x]);
+            }
+            ptr += m.w;
+            printf("\n");
+        }
+        printf("------------------------\n");
+    }
+}
+```
+In Android Studio, `printf` will not work, you can use `__android_log_print` instead. Example :
+```C++
+#include <android/log.h>  // Don't forget this
+
+void pretty_print(const ncnn::Mat& m)
+{
+    for (int q=0; q<m.c; q++)
+    {
+        for (int y=0; y<m.h; y++)
+        {
+            for (int x=0; x<m.w; x++)
+            {
+                __android_log_print(ANDROID_LOG_DEBUG,"LOG_TAG","ncnn Mat is : %f", m.channel(q).row(y)[x]);
+            }
+        }
+    }
+}
+```
+
+- ## 如何可视化 ncnn::Mat 的值？
+
+```
+void visualize(const char* title, const ncnn::Mat& m)
+{
+    std::vector<cv::Mat> normed_feats(m.c);
+
+    for (int i=0; i<m.c; i++)
+    {
+        cv::Mat tmp(m.h, m.w, CV_32FC1, (void*)(const float*)m.channel(i));
+
+        cv::normalize(tmp, normed_feats[i], 0, 255, cv::NORM_MINMAX, CV_8U);
+
+        cv::cvtColor(normed_feats[i], normed_feats[i], cv::COLOR_GRAY2BGR);
+
+        // check NaN
+        for (int y=0; y<m.h; y++)
+        {
+            const float* tp = tmp.ptr<float>(y);
+            uchar* sp = normed_feats[i].ptr<uchar>(y);
+            for (int x=0; x<m.w; x++)
+            {
+                float v = tp[x];
+                if (v != v)
+                {
+                    sp[0] = 0;
+                    sp[1] = 0;
+                    sp[2] = 255;
+                }
+
+                sp += 3;
+            }
+        }
+    }
+
+    int tw = m.w < 10 ? 32 : m.w < 20 ? 16 : m.w < 40 ? 8 : m.w < 80 ? 4 : m.w < 160 ? 2 : 1;
+    int th = (m.c - 1) / tw + 1;
+
+    cv::Mat show_map(m.h * th, m.w * tw, CV_8UC3);
+    show_map = cv::Scalar(127);
+
+    // tile
+    for (int i=0; i<m.c; i++)
+    {
+        int ty = i / tw;
+        int tx = i % tw;
+
+        normed_feats[i].copyTo(show_map(cv::Rect(tx * m.w, ty * m.h, m.w, m.h)));
+    }
+
+    cv::resize(show_map, show_map, cv::Size(0,0), 2, 2, cv::INTER_NEAREST);
+    cv::imshow(title, show_map);
+}
+```
+
+- ## 总是输出第一张图的结果
+
+   复用 Extractor？！
+
+- ## 启用fp16时的精度有差异
+
+   net.opt.use_fp16_packed = false;
+
+   net.opt.use_fp16_storage = false;
+
+   net.opt.use_fp16_arithmetic = false;
+
+   [ncnn-produce-wrong-result](https://github.com/Tencent/ncnn/wiki/FAQ-ncnn-produce-wrong-result)
+
+
+# 如何跑得更快？内存占用更少？库体积更小？
+
+- ## fp32 fp16
+
+- ## 大小核绑定
+   ncnn::set_cpu_powersave(int)绑定大核或小核
+   注意windows系统不支持绑核。
+   ncnn支持不同的模型运行在不同的核心。假设硬件平台有2个大核，4个小核，你想把netA运行在大核，netB运行在小核。
+   可以通过std::thread or pthread创建两个线程，运行如下代码：
+   0:全部
+   1:小核
+   2:大核
+```
+   void thread_1()
+   {
+      ncnn::set_cpu_powersave(2); // bind to big cores
+      netA.opt.num_threads = 2;
+   }
+
+   void thread_2()
+   {
+      ncnn::set_cpu_powersave(1); // bind to little cores
+      netB.opt.num_threads = 4;
+   }
+```
+
+   [openmp-best-practice.zh.md](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md)
+
+- ## 查看 CPU 或 GPU 数量
+   get_cpu_count
+   
+   get_gpu_count
+
+- ## ncnnoptimize
+
+   使用方式一：
+    - ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag
+    <br/>注意这里的flag指的是fp32和fp16，其中0指的是fp32，1指的是fp16
+
+   使用方式二：
+    - ./ncnnoptimize ncnn.param ncnn.bin new.param new.bin flag cutstartname cutendname
+    <br/>cutstartname：模型截取的起点
+     <br/>cutendname：模型截取的终点
+
+
+- ## 如何使用量化工具？
+
+   [Post Training Quantization Tools](https://github.com/Tencent/ncnn/tree/master/tools/quantize)
+
+- ## 如何设置线程数？
+
+   opt.num_threads
+
+- ## 如何降低CPU占用率？
+
+   net.opt.openmp_blocktime = 0;
+   
+   OMP_WAIT_POLICY=passive
+
+- ## 如何 batch inference？
+
+```
+   int max_batch_size = vkdev->info.compute_queue_count;
+   
+   ncnn::Mat inputs[1000];
+   ncnn::Mat outputs[1000];
+   
+   #pragma omp parallel for num_threads(max_batch_size)
+   for (int i=0; i<1000; i++)
+   {
+       ncnn::Extractor ex = net1.create_extractor();
+       ex.input("data", inputs[i]);
+       ex.extract("prob", outputs[i]);
+   }
+```
+
+   
+
+- ## partial graph inference
+
+   先 extract 分类，判断后，再 extract bbox
+
+- ## 如何启用 bf16s 加速？
+
+```
+net.opt.use_packing_layout = true;
+net.opt.use_bf16_storage = true;
+```
+
+   [用bf16加速ncnn](https://zhuanlan.zhihu.com/p/112564372) **@[nihui](https://www.zhihu.com/people/nihui-2)**
+
+   A53
+
+- ## 如何裁剪更小的 ncnn 库？
+
+   [build-minimal-library](https://github.com/Tencent/ncnn/wiki/build-minimal-library)
+
+- ## net.opt sgemm winograd fp16_storage 各是有什么作用？
+
+   对内存消耗的影响
+
+# 白嫖项目
+
+- ## nanodet
+
+# 其他
+
+- ## up主用的什么系统/编辑器/开发环境？
+
+   | 软件类型     |   软件名称  |
+   | ------------| ----------- |
+   | 系统        | Fedora       |
+   | 桌面环境     | KDE         |
+   | 编辑器       | Kate        |
+   | 画草图       | kolourpaint |
+   | 画函数图像   | kmplot      |
+   | bilibili直播 |  OBS         |
--- a/3rdparty/ncnn/docs/how-to-build/build-for-VisualStudio.zh.md
+++ b/3rdparty/ncnn/docs/how-to-build/build-for-VisualStudio.zh.md
@ -0,0 +1,139 @@
+# 用 Visual Studio 编译
+
+[TOC]
+
+## 预先准备
+
+Visual Studio 2015 / 2017 / 2019 / 2022 Preview 的 Community Edition 版本， 使用动态的 CRT 运行库
+
+CMake,  推荐 >= 3.17 的版本
+
+## 开始编译
+
+### 最简编译
+
+https://github.com/Tencent/ncnn.git
+
+#### 命令提示符版本
+
+```batch
+mkdir build-vs2019
+cd build-vs2019
+cmake -G "Visual Studio 16 2019" -A x64 ..
+cmake --build . --config Release
+cmake --install . --config Release
+cmake --build . --config Debug
+cmake --install . --config Debug
+```
+
+会安装在 build-vs2019/install 里头，debug 版本的库会带有 `d` 后缀。
+
+#### x64 本机工具命令提示符 版本 （VS2022无X64）
+ncnn
+protobuf参照后文定义参数
+
+```batch
+mkdir build-vs2019
+cd build-vs2019
+cmake ..
+cmake --build . 
+cmake --install .  --config Debug
+
+//默认build生成Debug版本；默认install安装Relase版本。 参照命令提示符版本
+```
+
+
+### 编译安装带 Vulkan 支持的 ncnn 库
+
+#### 设备和 Vulkan 准备
+确认设备支持 Vulkan， 安装显卡驱动。
+
+下载和安装 Vulkan SDK: https://vulkan.lunarg.com/sdk/home
+
+连同子模块一起，获取源码：
+- 可从 http://github.com/Tencent/ncnn/releases 找到 "ncnn-YYYYMMDD-full-source.zip" 下载
+- 或用 git 获取最新版本：
+
+```batch
+git clone https://github.com/tencent/ncnn
+git submodule update --init
+```
+
+#### 编译安装 ncnn
+```batch
+mkdir build-vs2019
+cd build-vs2019
+cmake -G "Visual Studio 16 2019"  -A x64  -DCMAKE_INSTALL_PREFIX="%cd%/install"  -DNCNN_VULKAN=ON
+cmake --build . --config Release
+cmake --install . --config Release
+cmake --build . --config Debug
+cmake --install . --config Debug
+```
+
+### 编译安装 ncnn 库和模型转换工具
+
+- 此步骤用于编译模型转换工具，可跳过，直接使用 https://convertmodel.com 工具转换
+
+- 以下命令行均使用  **适用于 VS 2019 的 x64 本机工具命令提示** 
+
+*注：若在 cmd / PowerShell 进行，需修改：*
+- `-G"NMake Makefile"` 改为合适的 Generator 如 `-G "Visual Studio 16 2019" -A x64`
+- `nmake` 改为 `cmake --build . --config Release`， 或打开 `.sln` 手动触发 `protobuf` / `ncnn` 项的构建
+- `nmake install` 改为 `cmake --install . --config Release`，或打开 `.sln` 手动触发 `INSTALL` 项的构建
+
+
+#### 编译安装 protobuf
+
+用于生成 caffe2ncnn 和 onnx2ncnn 工具
+
+https://github.com/google/protobuf/archive/v3.4.0.zip
+
+我下载到 C:/Users/shuiz/source 解压缩
+
+```batch
+mkdir build-vs2019
+cd build-vs2019
+cmake -G"NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="%cd%/install" ^
+    -Dprotobuf_BUILD_TESTS=OFF ^
+    -Dprotobuf_MSVC_STATIC_RUNTIME=OFF ../cmake
+nmake
+nmake install
+```
+
+protobuf 会安装在 build-vs2019/install 里头
+
+#### 编译安装 ncnn
+
+https://github.com/Tencent/ncnn.git
+
+cmake 命令中的 protobuf 路径要相应修改成自己的
+
+```batch
+mkdir build-vs2019
+cd build-vs2019
+cmake -G"NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX="%cd%/install" ^
+    -DProtobuf_INCLUDE_DIR=C:/Users/shuiz/source/protobuf-3.4.0/build-vs2019/install/include ^
+    -DProtobuf_LIBRARIES=C:/Users/shuiz/source/protobuf-3.4.0/build-vs2019/install/lib/libprotobuf.lib ^
+    -DProtobuf_PROTOC_EXECUTABLE=C:/Users/shuiz/source/protobuf-3.4.0/build-vs2019/install/bin/protoc.exe ..
+nmake
+nmake install
+```
+
+ncnn 会安装在 build-vs2019/install 里头
+
+ncnn 转换工具在 build-vs2019/tools 里头
+
+#### mlir2ncnn
+
+见 [build-mlir2ncnn](build-mlir2ncnn.md)
+
+## 使用编译好的 ncnn 库
+
+CMakeLists 里写
+```cmake
+set(ncnn_DIR "C:/Users/shuiz/source/ncnn/build-vs2019/install/lib/cmake/ncnn" CACHE PATH "包含 ncnnConfig.cmake 的目录")
+find_package(ncnn REQUIRED)
+target_link_libraries(my_target ncnn)
+```
+
+进一步了解 [use-ncnn-with-own-project](../how-to-use-and-FAQ/use-ncnn-with-own-project.md)
--- a/3rdparty/ncnn/docs/how-to-build/build-mlir2ncnn.md
+++ b/3rdparty/ncnn/docs/how-to-build/build-mlir2ncnn.md
@ -0,0 +1,54 @@
+# mlir2ncnn
+
+## Compile
+
+**Clone LLVM**
+```bash
+https://github.com/llvm/llvm-project.git
+git checkout -b mlir <a_working_commit_id>
+```
+Current working commit id is 74e6030bcbcc8e628f9a99a424342a0c656456f9:
+```
+$ git log
+
+commit 74e6030bcbcc8e628f9a99a424342a0c656456f9 (HEAD -> main, origin/main, origin/HEAD)
+Author: Craig Topper <craig.topper@sifive.com>
+Date:   Thu Mar 4 22:30:38 2021 -0800
+
+    [TargetLowering] Use HandleSDNodes to prevent nodes from being deleted by recursive calls in getNegatedExpression.
+```
+
+It is determined by query lastest git commit date of `tools/mlir` directory.
+
+
+**Compile mlir**
+```bash
+cd llvm-project
+mkdir build
+cd build
+cmake -G Ninja -DCMAKE_INSTALL_PREFIX=install -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON -DLLVM_ENABLE_PROJECTS="mlir" -DLLVM_TARGETS_TO_BUILD="" -DLLVM_INCLUDE_EXAMPLES=OFF -DLLVM_INCLUDE_TESTS=OFF ../llvm/
+ninja -j8
+ninja install
+```
+
+**Compile mlir2ncnn**
+```bash
+cd tools/mlir
+mkdir build
+cd build
+cmake .. -D LLVM_DIR=<path/to/your/llvm_install/lib/cmake/llvm>
+make
+```
+
+## Usage
+
+**Export `.mlir`**
+
+See https://zhuanlan.zhihu.com/p/152535430
+
+
+**Usage mlir2ncnn**
+
+```
+./mlir2ncnn pix2pix.mlir pix2pix.param pix2pix.bin
+```
--- a/3rdparty/ncnn/docs/how-to-build/how-to-build.md
+++ b/3rdparty/ncnn/docs/how-to-build/how-to-build.md
@ -0,0 +1,668 @@
+### Git clone ncnn repo with submodule
+
+```
+$ git clone https://github.com/Tencent/ncnn.git
+$ cd ncnn
+$ git submodule update --init
+```
+
+* [Build for Linux / NVIDIA Jetson / Raspberry Pi](#build-for-linux)
+* [Build for Windows x64 using VS2017](#build-for-windows-x64-using-visual-studio-community-2017)
+* [Build for macOS](#build-for-macos)
+* [Build for ARM Cortex-A family with cross-compiling](#build-for-arm-cortex-a-family-with-cross-compiling)
+* [Build for Hisilicon platform with cross-compiling](#build-for-hisilicon-platform-with-cross-compiling)
+* [Build for Android](#build-for-android)
+* [Build for iOS on macOS with xcode](#build-for-ios-on-macos-with-xcode)
+* [Build for WebAssembly](#build-for-webassembly)
+* [Build for AllWinner D1](#build-for-allwinner-d1)
+* [Build for Loongson 2K1000](#build-for-loongson-2k1000)
+* [Build for Termux on Android](#Build-for-Termux-on-Android)
+
+***
+
+### Build for Linux
+
+Install required build dependencies:
+
+* git
+* g++
+* cmake
+* protocol buffer (protobuf) headers files and protobuf compiler
+* vulkan header files and loader library
+* glslang
+* (optional) opencv  # For building examples
+
+Generally if you have Intel, AMD or Nvidia GPU from last 10 years, Vulkan can be easily used.
+
+On some systems there are no Vulkan drivers easily available at the moment (October 2020), so you might need to disable use of Vulkan on them. This applies to Raspberry Pi 3 (but there is experimental open source Vulkan driver in the works, which is not ready yet). Nvidia Tegra series devices (like Nvidia Jetson) should support Vulkan. Ensure you have most recent software installed for best experience.
+
+On Debian, Ubuntu or Raspberry Pi OS, you can install all required dependencies using: 
+```shell
+sudo apt install build-essential git cmake libprotobuf-dev protobuf-compiler libvulkan-dev vulkan-utils libopencv-dev
+```
+To use Vulkan backend install Vulkan header files, a vulkan driver loader, GLSL to SPIR-V compiler and vulkaninfo tool. Preferably from your distribution repositories. Alternatively download and install full Vulkan SDK (about 200MB in size; it contains all header files, documentation and prebuilt loader, as well some extra tools and source code of everything) from https://vulkan.lunarg.com/sdk/home
+
+```shell
+wget https://sdk.lunarg.com/sdk/download/1.2.189.0/linux/vulkansdk-linux-x86_64-1.2.189.0.tar.gz?Human=true -O vulkansdk-linux-x86_64-1.2.189.0.tar.gz
+tar -xf vulkansdk-linux-x86_64-1.2.189.0.tar.gz
+export VULKAN_SDK=$(pwd)/1.2.189.0/x86_64
+```
+
+To use Vulkan after building ncnn later, you will also need to have Vulkan driver for your GPU. For AMD and Intel GPUs these can be found in Mesa graphics driver, which usually is installed by default on all distros (i.e. `sudo apt install mesa-vulkan-drivers` on Debian/Ubuntu). For Nvidia GPUs the proprietary Nvidia driver must be downloaded and installed (some distros will allow easier installation in some way). After installing Vulkan driver, confirm Vulkan libraries and driver are working, by using `vulkaninfo` or `vulkaninfo | grep deviceType`, it should list GPU device type. If there are more than one GPU installed (including the case of integrated GPU and discrete GPU, commonly found in laptops), you might need to note the order of devices to use later on.
+
+#### Nvidia Jetson
+
+The Vulkan driver is a default component of the Linux For Tegra BSP release, check [the device list](https://developer.nvidia.com/embedded/vulkan).
+
+```shell
+cd ncnn
+mkdir -p build
+cd build
+cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=../toolchains/jetson.toolchain.cmake -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..
+make -j$(nproc)
+```
+
+#### Raspberry Pi
+
+Vulkan drivers do exists, but are not mature. You are free to experiment at your own discretion, and report results and performance.
+
+```shell
+cd ncnn
+mkdir -p build
+cd build
+cmake -DCMAKE_BUILD_TYPE=Release -DNCNN_VULKAN=ON -DNCNN_SYSTEM_GLSLANG=ON -DNCNN_BUILD_EXAMPLES=ON ..
+make -j$(nproc)
+```
+
+You can add `-GNinja` to `cmake` above to use Ninja build system (invoke build using `ninja` or `cmake --build .`).
+
+For Rasberry Pi 3, add `-DCMAKE_TOOLCHAIN_FILE=../toolchains/pi3.toolchain.cmake -DPI3=ON` to cmake. You can also consider disabling Vulkan support as the Vulkan drivers for Rasberry Pi are still not mature, but it doesn't hurt to build the support in, but not use it.
+
+#### Verification
+
+Verify build by running some examples:
+
+```shell
+cd ../examples
+../build/examples/squeezenet ../images/256-ncnn.png
+[0 AMD RADV FIJI (LLVM 10.0.1)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
+[0 AMD RADV FIJI (LLVM 10.0.1)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0
+[0 AMD RADV FIJI (LLVM 10.0.1)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
+532 = 0.163452
+920 = 0.093140
+716 = 0.061584
+```
+
+You can also run benchmarks (the 4th argument is a GPU device index to use, refer to `vulkaninfo`, if you have more than one GPU):
+
+```shell
+cd ../benchmark
+../build/benchmark/benchncnn 10 $(nproc) 0 0
+[0 AMD RADV FIJI (LLVM 10.0.1)]  queueC=1[4]  queueG=0[1]  queueT=0[1]
+[0 AMD RADV FIJI (LLVM 10.0.1)]  bugsbn1=0  buglbia=0  bugcopc=0  bugihfa=0
+[0 AMD RADV FIJI (LLVM 10.0.1)]  fp16p=1  fp16s=1  fp16a=0  int8s=1  int8a=1
+num_threads = 4
+powersave = 0
+gpu_device = 0
+cooling_down = 1
+          squeezenet  min =    4.68  max =    4.99  avg =    4.85
+     squeezenet_int8  min =   38.52  max =   66.90  avg =   48.52
+...
+```
+
+To run benchmarks on a CPU, set the 5th argument to `-1`.
+
+
+***
+
+### Build for Windows x64 using Visual Studio Community 2017
+
+Download and Install Visual Studio Community 2017 from https://visualstudio.microsoft.com/vs/community/
+
+Start the command prompt: `Start → Programs → Visual Studio 2017 → Visual Studio Tools → x64 Native Tools Command Prompt for VS 2017`
+
+Download protobuf-3.4.0 from https://github.com/google/protobuf/archive/v3.4.0.zip
+
+Build protobuf library:
+
+```shell
+cd <protobuf-root-dir>
+mkdir build
+cd build
+cmake -G"NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=%cd%/install -Dprotobuf_BUILD_TESTS=OFF -Dprotobuf_MSVC_STATIC_RUNTIME=OFF ../cmake
+nmake
+nmake install
+```
+(optional) Download and install Vulkan SDK from https://vulkan.lunarg.com/sdk/home
+
+Build ncnn library (replace <protobuf-root-dir> with a proper path):
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build
+cd build
+cmake -G"NMake Makefiles" -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=%cd%/install -DProtobuf_INCLUDE_DIR=<protobuf-root-dir>/build/install/include -DProtobuf_LIBRARIES=<protobuf-root-dir>/build/install/lib/libprotobuf.lib -DProtobuf_PROTOC_EXECUTABLE=<protobuf-root-dir>/build/install/bin/protoc.exe -DNCNN_VULKAN=ON ..
+nmake
+nmake install
+```
+
+Note: To speed up compilation process on multi core machines, configuring `cmake` to use `jom` or `ninja` using `-G` flag is recommended.
+
+***
+### Build for macOS
+
+We've published ncnn to [brew](https://formulae.brew.sh/formula/ncnn#default) now, you can just use following method to install ncnn if you have the Xcode Command Line Tools installed.
+
+```shell
+brew update
+brew install ncnn
+```
+
+Or if you want to compile and build ncnn locally, first install Xcode or Xcode Command Line Tools according to your needs.
+
+Then install `protobuf` and `libomp` via homebrew
+
+```shell
+brew install protobuf libomp
+```
+
+Download and install Vulkan SDK from <https://vulkan.lunarg.com/sdk/home>
+
+
+```shell
+wget https://sdk.lunarg.com/sdk/download/1.2.189.0/mac/vulkansdk-macos-1.2.189.0.dmg?Human=true -O vulkansdk-macos-1.2.189.0.dmg
+hdiutil attach vulkansdk-macos-1.2.189.0.dmg
+sudo /Volumes/vulkansdk-macos-1.2.189.0/InstallVulkan.app/Contents/MacOS/InstallVulkan --root `pwd`/vulkansdk-macos-1.2.189.0 --accept-licenses --default-answer --confirm-command install
+hdiutil detach /Volumes/vulkansdk-macos-1.2.189.0
+
+# setup env
+export VULKAN_SDK=`pwd`/vulkansdk-macos-1.2.189.0/macOS
+```
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build
+cd build
+
+cmake -DCMAKE_OSX_ARCHITECTURES="x86_64;arm64" \
+    -DVulkan_INCLUDE_DIR=`pwd`/../vulkansdk-macos-1.2.189.0/MoltenVK/include \
+    -DVulkan_LIBRARY=`pwd`/../vulkansdk-macos-1.2.189.0/MoltenVK/dylib/macOS/libMoltenVK.dylib \
+    -DNCNN_VULKAN=ON -DNCNN_BUILD_EXAMPLES=ON ..
+
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+*Note: If you encounter `libomp` related errors during installation, you can also check our GitHub Actions at [here](https://github.com/Tencent/ncnn/blob/d91cccf/.github/workflows/macos-x64-gpu.yml#L50-L68) to install and use `openmp`.*
+***
+
+### Build for ARM Cortex-A family with cross-compiling
+Download ARM toolchain from https://developer.arm.com/open-source/gnu-toolchain/gnu-a/downloads
+
+```shell
+export PATH="<your-toolchain-compiler-path>:${PATH}"
+```
+
+Alternatively install a cross-compiler provided by the distribution (i.e. on Debian / Ubuntu, you can do `sudo apt install g++-arm-linux-gnueabi g++-arm-linux-gnueabihf g++-aarch64-linux-gnu`).
+
+Depending on your needs build one or more of the below targets.
+
+AArch32 target with soft float (arm-linux-gnueabi)
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-arm-linux-gnueabi
+cd build-arm-linux-gnueabi
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabi.toolchain.cmake ..
+make -j$(nproc)
+```
+
+AArch32 target with hard float (arm-linux-gnueabihf)
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-arm-linux-gnueabihf
+cd build-arm-linux-gnueabihf
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/arm-linux-gnueabihf.toolchain.cmake ..
+make -j$(nproc)
+```
+
+AArch64 GNU/Linux target (aarch64-linux-gnu)
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-aarch64-linux-gnu
+cd build-aarch64-linux-gnu
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/aarch64-linux-gnu.toolchain.cmake ..
+make -j$(nproc)
+```
+
+***
+
+### Build for Hisilicon platform with cross-compiling
+Download and install Hisilicon SDK. The toolchain should be in `/opt/hisi-linux/x86-arm`
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build
+cd build
+
+# Choose one cmake toolchain file depends on your target platform
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv300.toolchain.cmake ..
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/hisiv500.toolchain.cmake ..
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix100.toolchain.cmake ..
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/himix200.toolchain.cmake ..
+
+make -j$(nproc)
+make install
+```
+
+***
+
+### Build for Android
+You can use the pre-build ncnn-android-lib.zip from https://github.com/Tencent/ncnn/releases
+
+Download Android NDK from http://developer.android.com/ndk/downloads/index.html and install it, for example:
+
+```shell
+unzip android-ndk-r21d-linux-x86_64.zip
+export ANDROID_NDK=<your-ndk-root-path>
+```
+
+(optional) remove the hardcoded debug flag in Android NDK [android-ndk issue](https://github.com/android-ndk/ndk/issues/243)
+```
+# open $ANDROID_NDK/build/cmake/android.toolchain.cmake
+# delete "-g" line
+list(APPEND ANDROID_COMPILER_FLAGS
+  -g
+  -DANDROID
+```
+
+Build armv7 library
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-android-armv7
+cd build-android-armv7
+
+cmake -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK/build/cmake/android.toolchain.cmake" \
+    -DANDROID_ABI="armeabi-v7a" -DANDROID_ARM_NEON=ON \
+    -DANDROID_PLATFORM=android-14 ..
+
+# If you want to enable Vulkan, platform api version >= android-24 is needed
+cmake -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK/build/cmake/android.toolchain.cmake" \
+  -DANDROID_ABI="armeabi-v7a" -DANDROID_ARM_NEON=ON \
+  -DANDROID_PLATFORM=android-24 -DNCNN_VULKAN=ON ..
+
+make -j$(nproc)
+make install
+```
+
+Pick `build-android-armv7/install` folder for further JNI usage.
+
+
+Build aarch64 library:
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-android-aarch64
+cd build-android-aarch64
+
+cmake -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK/build/cmake/android.toolchain.cmake"\
+  -DANDROID_ABI="arm64-v8a" \
+  -DANDROID_PLATFORM=android-21 ..
+
+# If you want to enable Vulkan, platform api version >= android-24 is needed
+cmake -DCMAKE_TOOLCHAIN_FILE="$ANDROID_NDK/build/cmake/android.toolchain.cmake" \
+  -DANDROID_ABI="arm64-v8a" \
+  -DANDROID_PLATFORM=android-24 -DNCNN_VULKAN=ON ..
+
+make -j$(nproc)
+make install
+```
+
+Pick `build-android-aarch64/install` folder for further JNI usage.
+
+***
+
+### Build for iOS on macOS with xcode
+You can use the pre-build ncnn.framework glslang.framework and openmp.framework from https://github.com/Tencent/ncnn/releases
+
+Install xcode
+
+You can replace ```-DENABLE_BITCODE=0``` to ```-DENABLE_BITCODE=1``` in the following cmake arguments if you want to build bitcode enabled libraries.
+
+Download and install openmp for multithreading inference feature on iPhoneOS
+```shell
+wget https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/openmp-11.0.0.src.tar.xz
+tar -xf openmp-11.0.0.src.tar.xz
+cd openmp-11.0.0.src
+
+# apply some compilation fix
+sed -i'' -e '/.size __kmp_unnamed_critical_addr/d' runtime/src/z_Linux_asm.S
+sed -i'' -e 's/__kmp_unnamed_critical_addr/___kmp_unnamed_critical_addr/g' runtime/src/z_Linux_asm.S
+
+mkdir -p build-ios
+cd build-ios
+
+cmake -DCMAKE_TOOLCHAIN_FILE=<ncnn-root-dir>/toolchains/ios.toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install \
+    -DIOS_PLATFORM=OS -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 -DIOS_ARCH="armv7;arm64;arm64e" \
+    -DPERL_EXECUTABLE=/usr/local/bin/perl \
+    -DLIBOMP_ENABLE_SHARED=OFF -DLIBOMP_OMPT_SUPPORT=OFF -DLIBOMP_USE_HWLOC=OFF ..
+
+cmake --build . -j 4
+cmake --build . --target install
+
+# copy openmp library and header files to xcode toolchain sysroot
+# <xcode-dir> is usually /Applications/Xcode.app or /Applications/Xcode-beta.app depends on your Xcode version
+sudo cp install/include/* <xcode-dir>/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/include
+sudo cp install/lib/libomp.a <xcode-dir>/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib
+```
+
+Download and install openmp for multithreading inference feature on iPhoneSimulator
+```shell
+wget https://github.com/llvm/llvm-project/releases/download/llvmorg-11.0.0/openmp-11.0.0.src.tar.xz
+tar -xf openmp-11.0.0.src.tar.xz
+cd openmp-11.0.0.src
+
+# apply some compilation fix
+sed -i'' -e '/.size __kmp_unnamed_critical_addr/d' runtime/src/z_Linux_asm.S
+sed -i'' -e 's/__kmp_unnamed_critical_addr/___kmp_unnamed_critical_addr/g' runtime/src/z_Linux_asm.S
+
+mkdir -p build-ios-sim
+cd build-ios-sim
+
+cmake -DCMAKE_TOOLCHAIN_FILE=<ncnn-root-dir>/toolchains/ios.toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install \
+    -DIOS_PLATFORM=SIMULATOR -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 -DIOS_ARCH="i386;x86_64" \
+    -DPERL_EXECUTABLE=/usr/local/bin/perl \
+    -DLIBOMP_ENABLE_SHARED=OFF -DLIBOMP_OMPT_SUPPORT=OFF -DLIBOMP_USE_HWLOC=OFF ..
+
+cmake --build . -j 4
+cmake --build . --target install
+
+# copy openmp library and header files to xcode toolchain sysroot
+# <xcode-dir> is usually /Applications/Xcode.app or /Applications/Xcode-beta.app depends on your Xcode version
+sudo cp install/include/* <xcode-dir>/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/include
+sudo cp install/lib/libomp.a <xcode-dir>/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib
+```
+
+Package openmp framework:
+```shell
+cd <openmp-root-dir>
+
+mkdir -p openmp.framework/Versions/A/Headers
+mkdir -p openmp.framework/Versions/A/Resources
+ln -s A openmp.framework/Versions/Current
+ln -s Versions/Current/Headers openmp.framework/Headers
+ln -s Versions/Current/Resources openmp.framework/Resources
+ln -s Versions/Current/openmp openmp.framework/openmp
+lipo -create build-ios/install/lib/libomp.a build-ios-sim/install/lib/libomp.a -o openmp.framework/Versions/A/openmp
+cp -r build-ios/install/include/* openmp.framework/Versions/A/Headers/
+sed -e 's/__NAME__/openmp/g' -e 's/__IDENTIFIER__/org.llvm.openmp/g' -e 's/__VERSION__/11.0/g' <ncnn-root-dir>/Info.plist > openmp.framework/Versions/A/Resources/Info.plist
+```
+
+Download and install Vulkan SDK from https://vulkan.lunarg.com/sdk/home
+```shell
+wget https://sdk.lunarg.com/sdk/download/1.2.189.0/mac/vulkansdk-macos-1.2.189.0.dmg?Human=true -O vulkansdk-macos-1.2.189.0.dmg
+hdiutil attach vulkansdk-macos-1.2.189.0.dmg
+sudo /Volumes/vulkansdk-macos-1.2.189.0/InstallVulkan.app/Contents/MacOS/InstallVulkan --root `pwd`/vulkansdk-macos-1.2.189.0 --accept-licenses --default-answer --confirm-command install
+hdiutil detach /Volumes/vulkansdk-macos-1.2.189.0
+
+# setup env
+export VULKAN_SDK=`pwd`/vulkansdk-macos-1.2.189.0/macOS
+```
+
+Build library for iPhoneOS:
+
+```shell
+cd <ncnn-root-dir>
+git submodule update --init
+mkdir -p build-ios
+cd build-ios
+
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DIOS_PLATFORM=OS -DIOS_ARCH="armv7;arm64;arm64e" \
+    -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 \
+    -DOpenMP_C_FLAGS="-Xclang -fopenmp" -DOpenMP_CXX_FLAGS="-Xclang -fopenmp" \
+    -DOpenMP_C_LIB_NAMES="libomp" -DOpenMP_CXX_LIB_NAMES="libomp" \
+    -DOpenMP_libomp_LIBRARY="/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib/libomp.a" \
+    -DNCNN_BUILD_BENCHMARK=OFF ..
+
+# vulkan is only available on arm64 devices
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DIOS_PLATFORM=OS64 -DIOS_ARCH="arm64;arm64e" \
+    -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 \
+    -DOpenMP_C_FLAGS="-Xclang -fopenmp" -DOpenMP_CXX_FLAGS="-Xclang -fopenmp" \
+    -DOpenMP_C_LIB_NAMES="libomp" -DOpenMP_CXX_LIB_NAMES="libomp" \
+    -DOpenMP_libomp_LIBRARY="/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk/usr/lib/libomp.a" \
+    -DVulkan_INCLUDE_DIR=$VULKAN_SDK/../MoltenVK/include \
+    -DVulkan_LIBRARY=$VULKAN_SDK/../MoltenVK/dylib/iOS/libMoltenVK.dylib \
+    -DNCNN_VULKAN=ON -DNCNN_BUILD_BENCHMARK=OFF ..
+
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Build library for iPhoneSimulator:
+
+```shell
+cd <ncnn-root-dir>
+mkdir -p build-ios-sim
+cd build-ios-sim
+
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/ios.toolchain.cmake -DIOS_PLATFORM=SIMULATOR -DIOS_ARCH="i386;x86_64" \
+    -DENABLE_BITCODE=0 -DENABLE_ARC=0 -DENABLE_VISIBILITY=0 \
+    -DOpenMP_C_FLAGS="-Xclang -fopenmp" -DOpenMP_CXX_FLAGS="-Xclang -fopenmp" \
+    -DOpenMP_C_LIB_NAMES="libomp" -DOpenMP_CXX_LIB_NAMES="libomp" \
+    -DOpenMP_libomp_LIBRARY="/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneSimulator.platform/Developer/SDKs/iPhoneSimulator.sdk/usr/lib/libomp.a" \
+    -DNCNN_BUILD_BENCHMARK=OFF ..
+
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Package glslang framework:
+```shell
+cd <ncnn-root-dir>
+
+mkdir -p glslang.framework/Versions/A/Headers
+mkdir -p glslang.framework/Versions/A/Resources
+ln -s A glslang.framework/Versions/Current
+ln -s Versions/Current/Headers glslang.framework/Headers
+ln -s Versions/Current/Resources glslang.framework/Resources
+ln -s Versions/Current/glslang glslang.framework/glslang
+libtool -static build-ios/install/lib/libglslang.a build-ios/install/lib/libSPIRV.a build-ios/install/lib/libOGLCompiler.a build-ios/install/lib/libOSDependent.a -o build-ios/install/lib/libglslang_combined.a
+libtool -static build-ios-sim/install/lib/libglslang.a build-ios-sim/install/lib/libSPIRV.a build-ios-sim/install/lib/libOGLCompiler.a build-ios-sim/install/lib/libOSDependent.a -o build-ios-sim/install/lib/libglslang_combined.a
+lipo -create build-ios/install/lib/libglslang_combined.a build-ios-sim/install/lib/libglslang_combined.a -o glslang.framework/Versions/A/glslang
+cp -r build/install/include/glslang glslang.framework/Versions/A/Headers/
+sed -e 's/__NAME__/glslang/g' -e 's/__IDENTIFIER__/org.khronos.glslang/g' -e 's/__VERSION__/1.0/g' Info.plist > glslang.framework/Versions/A/Resources/Info.plist
+```
+
+Package ncnn framework:
+```shell
+cd <ncnn-root-dir>
+
+mkdir -p ncnn.framework/Versions/A/Headers
+mkdir -p ncnn.framework/Versions/A/Resources
+ln -s A ncnn.framework/Versions/Current
+ln -s Versions/Current/Headers ncnn.framework/Headers
+ln -s Versions/Current/Resources ncnn.framework/Resources
+ln -s Versions/Current/ncnn ncnn.framework/ncnn
+lipo -create build-ios/install/lib/libncnn.a build-ios-sim/install/lib/libncnn.a -o ncnn.framework/Versions/A/ncnn
+cp -r build-ios/install/include/* ncnn.framework/Versions/A/Headers/
+sed -e 's/__NAME__/ncnn/g' -e 's/__IDENTIFIER__/com.tencent.ncnn/g' -e 's/__VERSION__/1.0/g' Info.plist > ncnn.framework/Versions/A/Resources/Info.plist
+```
+
+Pick `ncnn.framework` `glslang.framework` and `openmp.framework` folder for app development.
+
+***
+
+### Build for WebAssembly
+
+Install Emscripten
+
+```shell
+git clone https://github.com/emscripten-core/emsdk.git
+cd emsdk
+./emsdk install 2.0.8
+./emsdk activate 2.0.8
+
+source emsdk/emsdk_env.sh
+```
+
+Build without any extension for general compatibility:
+```shell
+mkdir -p build
+cd build
+cmake -DCMAKE_TOOLCHAIN_FILE=../emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \
+    -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \
+    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Build with WASM SIMD extension:
+```shell
+mkdir -p build-simd
+cd build-simd
+cmake -DCMAKE_TOOLCHAIN_FILE=../emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \
+    -DNCNN_THREADS=OFF -DNCNN_OPENMP=OFF -DNCNN_SIMPLEOMP=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \
+    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Build with WASM Thread extension:
+```shell
+mkdir -p build-threads
+cd build-threads
+cmake -DCMAKE_TOOLCHAIN_FILE=../emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \
+    -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=OFF -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \
+    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Build with WASM SIMD and Thread extension:
+```shell
+mkdir -p build-simd-threads
+cd build-simd-threads
+cmake -DCMAKE_TOOLCHAIN_FILE=../emsdk/upstream/emscripten/cmake/Modules/Platform/Emscripten.cmake \
+    -DNCNN_THREADS=ON -DNCNN_OPENMP=ON -DNCNN_SIMPLEOMP=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_SSE2=ON -DNCNN_AVX2=OFF -DNCNN_AVX=OFF \
+    -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_BENCHMARK=OFF ..
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Pick `build-XYZ/install` folder for further usage.
+
+***
+
+### Build for AllWinner D1
+
+Download c906 toolchain package from https://occ.t-head.cn/community/download?id=3913221581316624384
+
+```shell
+tar -xf riscv64-linux-x86_64-20210512.tar.gz
+export RISCV_ROOT_PATH=/home/nihui/osd/riscv64-linux-x86_64-20210512
+```
+
+Build ncnn with riscv-v vector and simpleocv enabled:
+```shell
+mkdir -p build-c906
+cd build-c906
+cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/c906.toolchain.cmake \
+    -DCMAKE_BUILD_TYPE=relwithdebinfo -DNCNN_OPENMP=OFF -DNCNN_THREADS=OFF -DNCNN_RUNTIME_CPU=OFF -DNCNN_RVV=ON \
+    -DNCNN_SIMPLEOCV=ON -DNCNN_BUILD_EXAMPLES=ON ..
+cmake --build . -j 4
+cmake --build . --target install
+```
+
+Pick `build-c906/install` folder for further usage.
+
+You can upload binary inside `build-c906/examples` folder and run on D1 board for testing.
+
+***
+
+### Build for Loongson 2K1000
+
+For gcc version < 8.5, you need to fix msa.h header for workaround msa fmadd/fmsub/maddv/msubv bug.
+
+Open ```/usr/lib/gcc/mips64el-linux-gnuabi64/8/include/msa.h```, find ```__msa_fmadd``` and ```__msa_fmsub``` and apply changes as the following
+```c
+// #define __msa_fmadd_w __builtin_msa_fmadd_w
+// #define __msa_fmadd_d __builtin_msa_fmadd_d
+// #define __msa_fmsub_w __builtin_msa_fmsub_w
+// #define __msa_fmsub_d __builtin_msa_fmsub_d
+#define __msa_fmadd_w(a, b, c) __builtin_msa_fmadd_w(c, b, a)
+#define __msa_fmadd_d(a, b, c) __builtin_msa_fmadd_d(c, b, a)
+#define __msa_fmsub_w(a, b, c) __builtin_msa_fmsub_w(c, b, a)
+#define __msa_fmsub_d(a, b, c) __builtin_msa_fmsub_d(c, b, a)
+```
+
+find ```__msa_maddv``` and ```__msa_msubv``` and apply changes as the following
+```c
+// #define __msa_maddv_b __builtin_msa_maddv_b
+// #define __msa_maddv_h __builtin_msa_maddv_h
+// #define __msa_maddv_w __builtin_msa_maddv_w
+// #define __msa_maddv_d __builtin_msa_maddv_d
+// #define __msa_msubv_b __builtin_msa_msubv_b
+// #define __msa_msubv_h __builtin_msa_msubv_h
+// #define __msa_msubv_w __builtin_msa_msubv_w
+// #define __msa_msubv_d __builtin_msa_msubv_d
+#define __msa_maddv_b(a, b, c) __builtin_msa_maddv_b(c, b, a)
+#define __msa_maddv_h(a, b, c) __builtin_msa_maddv_h(c, b, a)
+#define __msa_maddv_w(a, b, c) __builtin_msa_maddv_w(c, b, a)
+#define __msa_maddv_d(a, b, c) __builtin_msa_maddv_d(c, b, a)
+#define __msa_msubv_b(a, b, c) __builtin_msa_msubv_b(c, b, a)
+#define __msa_msubv_h(a, b, c) __builtin_msa_msubv_h(c, b, a)
+#define __msa_msubv_w(a, b, c) __builtin_msa_msubv_w(c, b, a)
+#define __msa_msubv_d(a, b, c) __builtin_msa_msubv_d(c, b, a)
+```
+
+Build ncnn with mips msa and simpleocv enabled:
+```shell
+mkdir -p build
+cd build
+cmake -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON -DNCNN_RUNTIME_CPU=OFF -DNCNN_MSA=ON -DNCNN_MMI=ON -DNCNN_SIMPLEOCV=ON ..
+cmake --build . -j 2
+cmake --build . --target install
+```
+
+Pick `build/install` folder for further usage.
+
+You can run binary inside `build/examples` folder for testing.
+
+***
+
+### Build for Termux on Android
+
+Install app Termux on your phone,and install Ubuntu in Termux.
+
+ If you want use ssh, just install openssh in Termux
+
+```
+pkg install proot-distro
+proot-distro install ubuntu
+```
+
+or you can see what system can be installed using `proot-distro list` 
+
+while you install ubuntu successfully, using `proot-distro login ubuntu` to login Ubuntu.
+
+Then make ncnn,no need to install any other dependencies.
+
+```
+git clone https://github.com/Tencent/ncnn.git
+cd ncnn
+git submodule update --init
+mkdir -p build
+cd build
+cmake -DCMAKE_BUILD_TYPE=Release -DNCNN_BUILD_EXAMPLES=ON -DNCNN_PLATFORM_API=OFF -DNCNN_SIMPLEOCV=ON ..
+make -j$(nproc)
+```
+
+Then you can run a test
+
+> on my Pixel 3 XL using Qualcomm 845,cant load `256-ncnn.png`
+
+```
+cd ../examples
+../build/examples/squeezenet ../images/128-ncnn.png
+```
+
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-produce-wrong-result.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-produce-wrong-result.md
@ -0,0 +1,172 @@
+### caffemodel should be row-major
+
+`caffe2ncnn` tool assumes the caffemodel is row-major (produced by c++ caffe train command).
+
+The kernel 3x3 weights should be stored as
+```
+a b c
+d e f
+g h i
+```
+
+However, matlab caffe produced col-major caffemodel.
+
+You have to transpose all the kernel weights by yourself or re-training using c++ caffe train command.
+
+Besides, you may interest in https://github.com/conanhujinming/matcaffe2caffe
+
+### check input is RGB or BGR
+
+If your caffemodel is trained using c++ caffe and opencv, then the input image should be BGR order.
+
+If your model is trained using matlab caffe or pytorch or mxnet or tensorflow, the input image would probably be RGB order.
+
+The channel order can be changed on-the-fly through proper pixel type enum
+```
+// construct RGB blob from rgb image
+ncnn::Mat in_rgb = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB, w, h);
+
+// construct BGR blob from bgr image
+ncnn::Mat in_bgr = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR, w, h);
+
+// construct BGR blob from rgb image
+ncnn::Mat in_bgr = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB2BGR, w, h);
+
+// construct RGB blob from bgr image
+ncnn::Mat in_rgb = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR2RGB, w, h);
+```
+
+
+### image decoding
+
+JPEG(`.jpg`,`.jpeg`) is loss compression, people may get different pixel value for same image on same position. 
+
+`.bmp` images are recommended instead.
+
+### interpolation / resizing
+
+There are several image resizing methods, which may generate different result for same input image.
+
+Even we specify same interpolation method, different frameworks/libraries and their various versions may also introduce difference.
+
+A good practice is feed same size image as the input layer expected, e.g. read a 224x244 bmp image when input layer need 224x224 size.
+
+
+### Mat::from_pixels/from_pixels_resize assume that the pixel data is continuous
+
+You shall pass continuous pixel buffer to from_pixels family.
+
+If your image is an opencv submat from an image roi, call clone() to get a continuous one.
+```
+cv::Mat image;// the image
+cv::Rect facerect;// the face rectangle
+
+cv::Mat faceimage = image(facerect).clone();// get a continuous sub image
+
+ncnn::Mat in = ncnn::Mat::from_pixels(faceimage.data, ncnn::Mat::PIXEL_BGR, faceimage.cols, faceimage.rows);
+```
+
+### pre process
+Apply pre process according to your training configuration
+
+Different model has different pre process config, you may find the following transform config in Data layer section
+```
+transform_param {
+    mean_value: 103.94
+    mean_value: 116.78
+    mean_value: 123.68
+    scale: 0.017
+}
+```
+Then the corresponding code for ncnn pre process is
+```cpp
+const float mean_vals[3] = { 103.94f, 116.78f, 123.68f };
+const float norm_vals[3] = { 0.017f, 0.017f, 0.017f };
+in.substract_mean_normalize(mean_vals, norm_vals);
+```
+
+Mean file is not supported currently
+
+So you have to pre process the input data by yourself (use opencv or something)
+```
+transform_param {
+    mean_file: "imagenet_mean.binaryproto"
+}
+```
+
+For pytorch or mxnet-gluon
+```python
+transforms.ToTensor(),
+transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+```
+Then the corresponding code for ncnn pre process is
+```cpp
+// R' = (R / 255 - 0.485) / 0.229 = (R - 0.485 * 255) / 0.229 / 255
+// G' = (G / 255 - 0.456) / 0.224 = (G - 0.456 * 255) / 0.224 / 255
+// B' = (B / 255 - 0.406) / 0.225 = (B - 0.406 * 255) / 0.225 / 255
+const float mean_vals[3] = {0.485f*255.f, 0.456f*255.f, 0.406f*255.f};
+const float norm_vals[3] = {1/0.229f/255.f, 1/0.224f/255.f, 1/0.225f/255.f};
+in.substract_mean_normalize(mean_vals, norm_vals);
+```
+
+### use the desired blob
+The blob names for input and extract are differ among models.
+
+For example, squeezenet v1.1 use "data" as input blob and "prob" as output blob while mobilenet-ssd use "data" as input blob and "detection_out" as output blob.
+
+Some models may need multiple input or produce multiple output.
+
+```cpp
+ncnn::Extractor ex = net.create_extractor();
+
+ex.input("data", in);// change "data" to yours
+ex.input("mask", mask);// change "mask" to yours
+
+ex.extract("output1", out1);// change "output1" to yours
+ex.extract("output2", out2);// change "output2" to yours
+```
+
+### blob may have channel gap
+Each channel pointer is aligned by 128bit in ncnn Mat structure.
+
+blob may have gaps between channels if (width x height) can not divided exactly by 4
+
+Prefer using ncnn::Mat::from_pixels or ncnn::Mat::from_pixels_resize for constructing input blob from image data
+
+If you do need a continuous blob buffer, reshape the output.
+```cpp
+// out is the output blob extracted
+ncnn::Mat flattened_out = out.reshape(out.w * out.h * out.c);
+
+// plain array, C-H-W
+const float* outptr = flattened_out;
+```
+
+### create new Extractor for each image
+The `ncnn::Extractor` object is stateful, if you reuse for different input, you will always get exact the same result cached inside.
+
+Always create new Extractor to process images in loop unless you do know how the stateful Extractor works.
+```cpp
+for (int i=0; i<count; i++)
+{
+    // always create Extractor
+    // it's cheap and almost instantly !
+    ncnn::Extractor ex = net.create_extractor();
+
+    // use
+    ex.input(your_data[i]);
+}
+```
+
+### use proper loading api
+
+If you want to load plain param file buffer, you shall use Net::load_param_mem instead of Net::load_param.
+
+For more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)
+
+```cpp
+ncnn::Net net;
+
+// param_buffer is the content buffe of XYZ.param file
+net.load_param_mem(param_buffer);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-protobuf-problem.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-protobuf-problem.zh.md
@ -0,0 +1,73 @@
+### Linux 编译 `caffe2ncnn` 时报 `Protobuf not found`
+
+一般是因为 protobuf 未安装或环境变量未设置
+
+1. 安装 protobuf
+
+Ubuntu 系统尝试以下命令
+> sudo apt-get install libprotobuf-dev protobuf-compiler
+
+CentOS 尝试
+> sudo yum install protobuf-devel.x86_64 protobuf-compiler.x86_64
+
+2. 然后设置 C++ 环境
+打开`~/.bashrc`，在末尾增加
+> export LD_LIBRARY_PATH=${YOUR_PROTOBUF_LIB_PATH}:$LD_LIBRARY_PATH
+
+3. 让配置生效
+> source ~/.bashrc
+
+
+### 编译 `caffe2ncnn` 时报 protoc 和 protobuf.so 版本不匹配
+
+一般是因为系统安装了不止一个 protobuf。
+
+#### 直接改链接路径
+1. 先看 protoc 需要的 so 版本号
+> ldd \`whereis protoc| awk '{print $2}'\` | grep libprotobuf.so
+
+例如是 libprotobuf.so.10
+
+2. 然后搜这个文件所在的路径
+> cd / && find . -type f | grep libprotobuf.so.10
+
+假设在`/home/user/mydir`
+
+3. 设置 protobuf.so 的搜索目录
+打开`~/.bashrc`，在末尾增加
+> export LD_LIBRARY_PATH=/home/user/mydir:$LD_LIBRARY_PATH
+
+4. 让配置生效
+> source ~/.bashrc
+
+#### 如果以上办法不行的话，尝试源码安装 protobuf
+
+1. 首先在 [protobuf/releases](https://github.com/protocolbuffers/protobuf/releases/tag/v3.10.0) 下载所需的 pb 版本，例如需要 v3.10.0 。注意要下载 -cpp 后缀的压缩包。
+
+2. 解压到某一目录，然后编译
+>  tar xvf protobuf-cpp-3.10.0.tar.gz && cd protobuf-3.10.0/
+./configure --prefix=/your_install_dir && make -j 3 && make install
+
+3. **不不不要**忽略`--prefix`直接安装到系统目录，源码编译好的 so 和头文件在`your_install_dir`里
+
+4. 设置 protobuf.so 的搜索目录
+打开`~/.bashrc`，在末尾增加
+
+```bash
+export LD_LIBRARY_PATH=/your_install_dir/lib:$LD_LIBRARY_PATH
+export CPLUS_INCLUDE_PATH=/your_install_dir/include:$CPLUS_INCLUDE_PATH
+```
+
+5. 让配置生效
+> source ~/.bashrc
+
+#### 如果以上办法还不行
+尝试删除已有protobuf（注意不要删到系统自带的，新手请谨慎），然后用以下命令重装所需的 so
+> sudo apt-get install --reinstall libprotobuf8
+
+版本号需改为自己的版本号
+
+### Windows 出现此类问题，基本思路也是 IDE 改环境变量
+
+### 行走江湖必备
+关于环境变量设置、工具和技巧，强烈建议学习下 https://missing.csail.mit.edu/ 
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-throw-error.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-throw-error.md
@ -0,0 +1,129 @@
+### param is too old, please regenerate
+
+Your model file is being the old format converted by an old caffe2ncnn tool.
+
+Checkout the latest ncnn code, build it and regenerate param and model binary files, and that should work.
+
+Make sure that your param file starts with the magic number 7767517.
+
+you may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet)
+
+### find_blob_index_by_name XYZ failed
+
+That means ncnn couldn't find the XYZ blob in the network. 
+
+You shall call Extractor::input()/extract() by blob name instead of layer name.
+
+For models loaded from binary param file or external memory, you shall call Extractor::input()/extract() by the enum defined in xxx.id.h because all the visible string literals have been stripped in binary form.
+
+This error usually happens when the input layer is not properly converted.
+
+You shall upgrade caffe prototxt/caffemodel before converting it to ncnn. Following snippet type shall be ok. 
+
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+
+you may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet).
+
+### layer XYZ not exists or registered
+
+Your network contains some operations that are not implemented in ncnn.
+
+You may implement them as custom layer followed in [how-to-implement-custom-layer-step-by-step](how-to-implement-custom-layer-step-by-step).
+
+Or you could simply register them as no-op if you are sure those operations make no sense.
+
+```cpp
+class Noop : public ncnn::Layer {};
+DEFINE_LAYER_CREATOR(Noop)
+
+net.register_custom_layer("LinearRegressionOutput", Noop_layer_creator);
+net.register_custom_layer("MAERegressionOutput", Noop_layer_creator);
+```
+
+### fopen XYZ.param/XYZ.bin failed
+
+File not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.
+
+### network graph not ready
+
+You shall call Net::load_param() first, then Net::load_model().
+
+This error may also happens when Net::load_param() failed, but not properly handled.
+
+For more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)
+
+### memory not 32-bit aligned at XYZ
+
+The pointer passed to Net::load_param() or Net::load_model() is not 32bit aligned.
+
+In practice, the head pointer of std::vector<unsigned char> is not guaranteed to be 32bit aligned.
+
+you can store your binary buffer in ncnn::Mat structure, its internal memory is aligned.
+
+### undefined reference to '__kmpc_XYZ_XYZ'
+
+use clang for building android shared library
+
+comment the following line in your Application.mk
+```
+NDK_TOOLCHAIN_VERSION := 4.9
+```
+
+### crash on android with '__kmp_abort_process'
+
+This usually happens if you bundle multiple shared library with openmp linked
+
+It is actually an issue of the android ndk https://github.com/android/ndk/issues/1028
+
+On old android ndk, modify the link flags as
+
+```
+-Wl,-Bstatic -lomp -Wl,-Bdynamic
+```
+
+For recent ndk >= 21
+
+```
+-fstatic-openmp
+```
+
+### dlopen failed: library "libomp.so" not found
+
+Newer android ndk defaults to dynamic openmp runtime
+
+modify the link flags as
+
+```
+-fstatic-openmp -fopenmp
+```
+
+### crash when freeing a ncnn dynamic library(*.dll/*.so) built with openMP
+
+for optimal performance, the openmp threadpool spin waits for about a second prior to shutting down in case more work becomes available. 
+
+If you unload a dynamic library that's in the process of spin-waiting, it will crash in the manner you see (most of the time).
+
+Just set OMP_WAIT_POLICY=passive in your environment, before calling loadlibrary. or Just wait a few seconds before calling freelibrary.
+
+You can also use the following method to set environment variables in your code:
+
+for msvc++:
+
+```
+SetEnvironmentVariable(_T("OMP_WAIT_POLICY"), _T("passive"));
+```
+
+for g++:
+
+```
+setenv("OMP_WAIT_POLICY", "passive", 1)
+```
+
+reference: https://stackoverflow.com/questions/34439956/vc-crash-when-freeing-a-dll-built-with-openmp
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-vulkan.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-vulkan.md
@ -0,0 +1,124 @@
+### how to enable ncnn vulkan capability
+
+follow [the build and install instruction](https://github.com/Tencent/ncnn/blob/master/docs/how-to-build/how-to-build.md)
+
+make sure you have installed vulkan sdk from [lunarg vulkan sdk website](https://vulkan.lunarg.com/sdk/home)
+
+Usually, you can enable the vulkan compute inference feature by adding only one line of code to your application.
+
+```cpp
+// enable vulkan compute feature before loading
+ncnn::Net net;
+net.opt.use_vulkan_compute = 1;
+```
+
+### does my graphics device support vulkan
+
+Some platforms have been tested and known working. In theory, if your platform support vulkan api, either 1.0 or 1.1, it shall work.
+
+* Y = known work
+* ? = shall work, not confirmed
+* / = not applied
+
+|    |windows|linux|android|mac|ios|
+|---|---|---|---|---|---|
+|intel|Y|Y|?|?|/|
+|amd|Y|Y|/|?|/|
+|nvidia|Y|Y|?|/|/|
+|qcom|/|/|Y|/|/|
+|apple|/|/|/|Y|Y|
+|arm|/|?|Y|/|/|
+
+You can search [the vulkan database](https://vulkan.gpuinfo.org) to see if your device supports vulkan.
+
+Some old buggy drivers may produce wrong result, that are blacklisted in ncnn and treated as non-vulkan capable device.
+You could check if your device and driver have this issue with  [my conformance test here](vulkan-conformance-test).
+Most of these systems are android with version lower than 8.1.
+
+### why using vulkan over cuda/opencl/metal
+
+In the beginning, I had no GPGPU programming experience, and I had to learn one.
+
+vulkan is considered more portable and well supported by venders and the cross-platform low-overhead graphics api. As a contrast, cuda is only available on nvidia device, metal is only available on macos and ios, while loading opencl library is banned in android 7.0+ and does not work on ios.
+
+### I got errors like "vkCreateComputePipelines failed -1000012000" or random stalls or crashes
+
+Upgrade your vulkan driver.
+
+[intel https://downloadcenter.intel.com/product/80939/Graphics-Drivers](https://downloadcenter.intel.com/product/80939/Graphics-Drivers)
+
+[amd https://www.amd.com/en/support](https://www.amd.com/en/support)
+
+[nvidia https://www.nvidia.com/Download/index.aspx](https://www.nvidia.com/Download/index.aspx)
+
+### how to use ncnn vulkan on android
+
+minimum android ndk version: android-ndk-r18b
+
+minimum sdk platform api version: android-24
+
+link your jni project with libvulkan.so
+
+[The squeezencnn example](https://github.com/Tencent/ncnn/tree/master/examples/squeezencnn) have equipped gpu inference, you could take it as reference.
+
+### how to use ncnn vulkan on ios
+
+setup vulkan sdk (https://vulkan.lunarg.com/sdk/home#mac)
+
+metal only works on real device with arm64 cpu (iPhone 5s and later)
+
+link your project with MoltenVK framework and Metal
+
+### what about the layers without vulkan support
+
+These layers have vulkan support currently
+
+AbsVal, BatchNorm, BinaryOp, Cast, Clip, Concat, Convolution, ConvolutionDepthWise, Crop, Deconvolution, DeconvolutionDepthWise, Dropout, Eltwise, Flatten, HardSigmoid, InnerProduct, Interp, LRN, Packing, Padding, Permute, Pooling(pad SAME not supported), PReLU, PriorBox, ReLU, Reorg, Reshape, Scale, ShuffleChannel, Sigmoid, Softmax, TanH, UnaryOp
+
+For these layers without vulkan support, ncnn inference engine will automatically fallback to cpu path.
+
+Thus, it is usually not a serious issue if your network only has some special head layers like SSD or YOLO. All examples in ncnn are known working properly with vulkan enabled.
+
+### my model runs slower on gpu than cpu
+
+The current vulkan inference implementation is far from the preferred state. Many handful optimization techniques are planned, such as winograd convolution, operator fusion, fp16 storage and arithmetic etc.
+
+It is common that your model runs slower on gpu than cpu on arm devices like mobile phones, since we have quite good arm optimization in ncnn ;)
+
+### vulkan device not found / extra high cpu utility while vulkan is enabled on nvidia gpu
+
+There are severel reasons could lead to this outcome. First please check your driver status with `nvidia-smi`. If you have correctly installed your driver, you should see something like this:
+
+```bash
+$ nvidia-smi
+Sat Mar 06 19:53:16 2021
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  GeForce GTX 1060   WDDM  | 00000000:02:00.0 Off |                  N/A |
+| N/A   31C    P8     5W /  N/A |     90MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+
+-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
+```
+
+If `nvidia-smi` crashes or cannot be found, please reinstall your graphics driver.
+
+If ncnn *is* utilizing the Tesla GPU, you can see your program in the `Processes` block at the bottom. In that case, it's likely some operators are not yet supported in Vulkan, and have fallbacked to the CPU, thus leading to a low utilization of the GPU.
+
+If you *couldn't* find your process running, plase check the active driver model, which can be found to the right of your device name. For Geforce and Titan GPUs, the default driver model is WDDM (Windows Desktop Driver Model), which supports both rendering graphics as well as computing. But for Tesla GPUs, without configuration, the driver model is defualted to TCC ([Tesla Computing Cluster](https://docs.nvidia.com/gameworks/content/developertools/desktop/tesla_compute_cluster.htm)). NVIDIA's TCC driver does not support Vulkan, so you need to use the following command to set the driver model back to WDDM, to use Vulkan:
+
+```bash
+$ nvidia-smi -g 0 -dm 0
+```
+
+The number following `-g` is the GPU ID (which can be found to the left of your device name in `nvidia-smi` output); and `-dm` stands for driver model, 0 refers to WDDM and 1 means TCC.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/build-minimal-library.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/build-minimal-library.md
@ -0,0 +1,136 @@
+For some reason, if you're not happy with the binary size of the ncnn library, then here is the cheatsheet that helps you to build a minimal ncnn :P
+
+### disable c++ rtti and exceptions
+
+```
+cmake -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON ..
+```
+* Cannot use RTTI and Exceptions when ncnn functions are called.
+
+### disable vulkan support
+
+```
+cmake -DNCNN_VULKAN=OFF ..
+```
+
+* Cannot use GPU acceleration.
+
+### disable NCNN_STDIO
+
+```
+cmake -DNCNN_STDIO=OFF ..
+```
+
+* Cannot load model from files, but can load model from memory or by Android Assets.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#load-model).
+
+### disable NCNN_STRING
+
+```
+cmake -DNCNN_STRING=OFF ..
+```
+
+* Cannot load human-readable param files with visible strings, but can load binary param.bin files.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#strip-visible-string)
+
+* Cannot identify blobs by string name when calling `Extractor::input / extract`, but can identify them by enum value in `id.h`.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#input-and-output).
+
+### disable NCNN_BF16
+
+```
+cmake -DNCNN_BF16=OFF ..
+```
+
+* Cannot use bf16 storage type in inference.
+
+
+### disable NCNN_INT8
+
+```
+cmake -DNCNN_INT8=OFF ..
+```
+
+* Cannot use quantized int8 inference.
+
+
+### drop pixel drawing functions
+
+```
+cmake -DNCNN_PIXEL_DRAWING=OFF ..
+```
+
+* Cannot use functions doing drawing basic shape and text like `ncnn::draw_rectangle_xx / ncnn::draw_circle_xx / ncnn::draw_text_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available.
+
+
+### drop pixel rotate and affine functions
+
+```
+cmake -DNCNN_PIXEL_ROTATE=OFF -DNCNN_PIXEL_AFFINE=OFF ..
+```
+
+* Cannot use functions doing rotatation and affine transformation like `ncnn::kanna_rotate_xx / ncnn::warpaffine_bilinear_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available. 
+
+### drop pixel functions
+
+```
+cmake -DNCNN_PIXEL=OFF ..
+```
+
+* Cannot use functions transferring from image to pixels like `Mat::from_pixels / from_pixels_resize / to_pixels / to_pixels_resize`, and need create a Mat and fill in data by hand.
+
+### disable openmp
+
+```
+cmake -DNCNN_OPENMP=OFF ..
+```
+
+* Cannot use openmp multi-threading acceleration. If you want to run a model in single thread on your target machine, it is recommended to close the option.
+
+### disable avx2 and arm82 optimized kernel
+
+```
+cmake -DNCNN_AVX2=OFF -DNCNN_ARM82=OFF ..
+```
+
+* Do not compile optimized kernels using avx2 / arm82 instruction set extensions. If your target machine does not support some of them, it is recommended to close the related options.
+
+### disable runtime cpu instruction dispatch
+
+```
+cmake -DNCNN_RUNTIME_CPU=OFF ..
+```
+
+* Cannot check supported cpu instruction set extensions and use related optimized kernels in runtime.
+* If you know which instruction set extensions are supported on your target machine like avx2 / arm82, you can open related options like `-DNCNN_AVX2=ON / -DNCNN_ARM82=ON` by hand and then sse2 / arm8 version kernels will not be compiled.
+
+### drop layers not used
+
+```
+cmake -DWITH_LAYER_absval=OFF -DWITH_LAYER_bnll=OFF ..
+```
+
+* If your model does not include some layers, taking absval / bnll as a example above, you can drop them.
+* Some key or dependency layers should not be dropped, like convolution / innerproduct, their dependency like padding / flatten, and activation like relu / clip.
+
+### disable c++ stl
+
+```
+cmake -DNCNN_SIMPLESTL=ON ..
+```
+
+* STL provided by compiler is no longer depended on, and use `simplestl` provided by ncnn as a replacement. Users also can only use `simplestl` when ncnn functions are called.
+* Usually with compiler parameters `-nodefaultlibs -fno-builtin -nostdinc++ -lc`
+* Need cmake parameters `cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_STL=system` to avoid STL conflict when compiling to Android.
+
+### drop optimized kernel not used
+
+* Modify the source code under `ncnn/src/layer/arm/` to delete unnecessary optimized kernels or replace them with empty functions.
+* You can also drop layers and related optimized kernels by `-DWITH_LAYER_absval=OFF` as mentioned above.
+
+### drop operators from BinaryOp UnaryOp
+
+* Modify `ncnn/src/layer/binaryop.cpp unaryop.cpp` and `ncnn/src/layer/arm/binaryop.cpp unaryop_arm.cpp` by hand to delete unnecessary operators.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/efficient-roi-resize-rotate.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/efficient-roi-resize-rotate.md
@ -0,0 +1,162 @@
+
+### image roi crop + convert to ncnn::Mat
+
+```
+--------------+
+|   y          |           /-------/
+| x +-------+  |          +-------+|
+|   |     roih |im_h  =>  |      roih
+|   +-roiw--+  |          +-roiw--+/
+|              |
+-----im_w-----+
+```
+```cpp
+ncnn::Mat in = ncnn::Mat::from_pixels_roi(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih);
+```
+For Android Application, it is :
+```cpp
+ncnn::Mat in = ncnn::Mat::from_android_bitmap_roi(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih);
+```
+
+### image roi crop + resize + convert to ncnn::Mat
+
+```
+--------------+
+|   y          |           /----/
+| x +-------+  |          +----+|
+|   |     roih |im_h  =>  |  target_h
+|   +-roiw--+  |          |    ||
+|              |          +----+/
+-----im_w-----+         target_w
+```
+```cpp
+ncnn::Mat in = ncnn::Mat::from_pixels_roi_resize(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih, target_w, target_h);
+```
+For Android Application, it is :
+```cpp
+ncnn::Mat in = ncnn::Mat::from_android_bitmap_roi_resize(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih, target_w, target_h);
+```
+
+### ncnn::Mat export image + offset paste
+
+```
+                +--------------+
+ /-------/      |   y          |
+-------+|      | x +-------+  |
+|       h|  =>  |   |       h  |im_h
+---w---+/      |   +---w---+  |
+                |              |
+                +-----im_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+out.to_pixels(data, ncnn::Mat::PIXEL_RGB, im_w * 3);
+```
+
+### ncnn::Mat export image + resize + roi paste
+
+```
+            +--------------+
+ /----/     |   y          |
+----+|     | x +-------+  |
+|    h| =>  |   |      roih|im_h
+|    ||     |   +-roiw--+  |
+-w--+/     |              |
+            +-----im_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+out.to_pixels_resize(data, ncnn::Mat::PIXEL_RGB, roiw, roih, im_w * 3);
+```
+
+### image roi crop + resize
+```
+--------------+
+|   y          |
+| x +-------+  |          +----+
+|   |      roih|im_h  =>  |  target_h
+|   +-roiw--+  |          |    |
+|              |          +----+
+-----im_w-----+         target_w
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+ncnn::resize_bilinear_c3(data, roiw, roih, im_w * 3, outdata, target_w, target_h, target_w * 3);
+```
+
+### image resize + offset paste
+```
+            +--------------+
+            |   y          |
+----+      | x +-------+  |
+|    h  =>  |   |     roih |im_h
+|    |      |   +-roiw--+  |
+-w--+      |              |
+            +-----im_w-----+
+```
+```cpp
+unsigned char* outdata = im.data + (y * im_w + x) * 3;
+ncnn::resize_bilinear_c3(data, w, h, w * 3, outdata, roiw, roih, im_w * 3);
+```
+
+### image roi crop + resize + roi paste
+```
+--------------+         +-----------------+
+|   y          |         |  roiy           |
+| x +-------+  |         |roix----------+  |
+|   |       h  |im_h  => |   |     target_h|outim_h
+|   +---w---+  |         |   |          |  |
+|              |         |   +-target_w-+  |
+-----im_w-----+         +-----outim_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+unsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;
+ncnn::resize_bilinear_c3(data, w, h, im_w * 3, outdata, target_w, target_h, outim_w * 3);
+```
+
+### image roi crop + rotate
+```
+--------------+
+|   y          |
+| x +-------+  |          +---+
+|   |  < <  h  |im_h  =>  | ^ |w
+|   +---w---+  |          | ^ |
+|              |          +---+
+-----im_w-----+            h
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+ncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, h * 3, 6);
+```
+
+### image rotate + offset paste
+```
+             +--------------+
+             |   y          |
+ +---+       | x +-------+  |
+ | ^ |h  =>  |   |  < <  w  |im_h
+ | ^ |       |   +---h---+  |
+ +---+       |              |
+   w         +-----im_w-----+
+```
+```cpp
+unsigned char* outdata = im.data + (y * im_w + x) * 3;
+ncnn::kanna_rotate_c3(data, w, h, w * 3, outdata, h, w, im_w * 3, 7);
+```
+
+### image roi crop + rotate + roi paste
+```
+--------------+         +-----------------+
+|   y          |         |        roiy     |
+| x +-------+  |         |   roix  +---+   |
+|   |  < <  h  |im_h  => |         | ^ w   |outim_h
+|   +---w---+  |         |         | ^ |   |
+|              |         |         +-h-+   |
+-----im_w-----+         +-----outim_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+unsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;
+ncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, outim_w * 3, 6);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/ncnn-load-model.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/ncnn-load-model.md
@ -0,0 +1,26 @@
+### the comprehensive model loading api table
+
+|load from|alexnet.param|alexnet.param.bin|alexnet.bin|
+|---|---|---|---|
+|file path|load_param(const char*)|load_param_bin(const char*)|load_model(const char*)|
+|file descriptor|load_param(FILE*)|load_param_bin(FILE*)|load_model(FILE*)|
+|file memory|load_param_mem(const char*)|load_param(const unsigned char*)|load_model(const unsigned char*)|
+|android asset|load_param(AAsset*)|load_param_bin(AAsset*)|load_model(AAsset*)|
+|android asset path|load_param(AAssetManager*, const char*)|load_param_bin(AAssetManager*, const char*)|load_model(AAssetManager*, const char*)|
+|custom IO reader|load_param(const DataReader&)|load_param_bin(const DataReader&)|load_model(const DataReader&)|
+
+### points to note
+
+1. Either of the following combination shall be enough for loading model
+    * alexnet.param + alexnet.bin
+    * alexnet.param.bin + alexnet.bin
+
+2. Never modify Net opt member after loading
+
+3. Most loading functions return 0 if success, except loading alexnet.param.bin and alexnet.bin from file memory, which returns the bytes consumed after loading
+    * int Net::load_param(const unsigned char*)
+    * int Net::load_model(const unsigned char*)
+
+4. It is recommended to load model from Android asset directly to avoid copying them to sdcard on Android platform
+
+5. The custom IO reader interface can be used to implement on-the-fly model decryption and loading
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.md
@ -0,0 +1,74 @@
+ncnn openmp best practice
+
+### CPU loadaverage is too high with ncnn.
+
+   When inference the neural network with ncnn, the cpu occupancy is very high even all CPU cores occupancy close to 100%.
+
+   If there are other threads or processes that require more cpu resources, the running speed of the program will drop severely.
+
+### The root cause of high CPU usage
+
+1. ncnn uses openmp API to speed up the inference compute. the thread count equals to the cpu core   count. If the computing work need to run frequently, it must consume many cpu resources.
+
+2. There is a thread pool managed by openmp, the pool size is equal to the cpu core size. (the max  vulue is 15 if there are much more cpu cores?)
+   Openmp need to sync the thread when acquiring and returning threads to the pool. In order to improve efficiency, almost all omp implementations use spinlock synchronization (except for simpleomp). 
+   The default spin time of the spinlock is 200ms. So after a thread is scheduled, the thread need to busy-wait up to 200ms.
+
+### Why the CPU usage is still high even using vulkan GPU acceleration.
+
+1. Openmp is also used when loading the param bin file, and this part runs on cpu.
+
+2. The fp32 to fp16 conversion before and after the GPU memory upload is executed on the cpu, and this part of the logic also uses openmp.
+
+### Solution
+```
+1. Bind to the specific cpu core.
+```
+   If you use a device with large and small core CPUs, it is recommended to bind large or small cores through ncnn::set_cpu_powersave(int). Note that Windows does not support binding cores. By the way,  it's possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope.
+Suppose your platform is 2 big cores + 4 little cores, and you want to execute model A on 2 big cores and model B on 4 little cores concurrently.
+
+create two threads via std::thread or pthread
+   ```
+   void thread_1()
+   {
+      ncnn::set_cpu_powersave(2); // bind to big cores
+      netA.opt.num_threads = 2;
+   }
+
+   void thread_2()
+   {
+      ncnn::set_cpu_powersave(1); // bind to little cores
+      netB.opt.num_threads = 4;
+   }
+   ```
+   
+```
+2. Use fewer threads.
+```
+   Set the number of threads to half of the cpu cores count or less through ncnn::set_omp_num_threads(int)  or change net.opt.num_threads field. If you are coding with clang libomp, it's recommended that the number of threads does not exceed 8. If you use other omp libraries, it is recommended that the number of threads does not exceed 4.
+```
+3. Reduce openmp spinlock blocktime.
+```
+   You can modify openmp blocktime by call ncnn::set_kmp_blocktime(int) method or modify net.opt.openmp_blocktime field.
+   This argument is the spin time set by the ncnn API, and the default is 20ms.You can set a smaller value according to
+   the situation, or directly change it to 0.
+
+   Limitations: At present, only the libomp library of clang is implemented. Neither vcomp nor libgomp have corresponding interfaces.
+   If it is not compiled with clang, this value is still 200ms by default.
+   If you use vcomp or libgomp, you can use the environment variable OMP_WAIT_POLICY=PASSIVE to disable spin time. If you use simpleomp,
+   It's no need to set this parameter.
+```
+4. Limit the number of threads available in the openmp thread pool.
+```
+   Even if the number of openmp threads is reduced, the CPU occupancy rate may still be high. This is more common on servers with
+   particularly many CPU cores. 
+   This is because the waiting threads in the thread pool use a spinlock to busy-wait, which can be reducedby limiting the number of
+   threads available in the thread pool.
+
+   Generally, you can set the OMP_THREAD_LIMIT environment variable. simpleomp currently does not support this feature so it's no need to be set.
+   Note that this environment variable is only valid if it is set before the program starts.
+```
+5. Disable openmp completely
+```
+   If there is only one cpu core, or use the vulkan gpu acceleration, it is recommended to disable openmp, just specify -DNCNN_OPENMP=OFF
+   when compiling with cmake.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md
@ -0,0 +1,70 @@
+ncnn openmp 最佳实践
+
+### ncnn占用过多cpu资源
+
+   使用ncnn推理运算，cpu占用非常高甚至所有核心占用都接近100%。
+
+   如果还有其它线程或进程需要较多的cpu资源，运行速度下降严重。
+
+### cpu占用高的根本原因
+
+1. ncnn使用openmp API控制多线程加速推理计算。默认情况下，线程数等于cpu内核数。如果推理需要高频率运行，必然占用大部分
+   cpu资源。
+
+2. openmp内部维护一个线程池，线程池最大可用线程数等于cpu内核数。(核心过多时最大限制是15？）获取和归还线程时需要同步。
+
+   为了提高效率，几乎所有omp实现都使用了自旋锁同步(simpleomp除外)。自旋锁默认的spin time是200ms。因此一个线程被调度后，
+   需要忙等待最多200ms。
+
+### 为什么使用vulkan加速后cpu占用依然很高。
+
+1. 加载参数文件时也使用了openmp，这部分是在cpu上运行的。
+
+2. 显存上传前和下载后的 fp32 fp16转换是在cpu上执行的，这部分逻辑也使用了openmp。
+
+### 解决方法
+
+```
+1. 绑核
+```
+   如果使用有大小核cpu的设备，建议通过ncnn::set_cpu_powersave(int)绑定大核或小核，注意windows系统不支持绑核。顺便说一下，ncnn支持不同的模型运行在不同的核心。假设硬件平台有2个大核，4个小核，你想把netA运行在大核，netB运行在小核。
+   可以通过std::thread or pthread创建两个线程，运行如下代码：
+   
+   ```
+   void thread_1()
+   {
+      ncnn::set_cpu_powersave(2); // bind to big cores
+      netA.opt.num_threads = 2;
+   }
+
+   void thread_2()
+   {
+      ncnn::set_cpu_powersave(1); // bind to little cores
+      netB.opt.num_threads = 4;
+   }
+   ```
+
+```
+2. 使用更少的线程数。
+```
+   通过ncnn::set_omp_num_threads(int)或者net.opt.num_threads字段设置线程数为cpu内核数的一半或更小。如果使用clang的libomp，
+   建议线程数不超过8，如果使用其它omp库，建议线程数不超过4。
+```
+3. 减小openmp blocktime。
+```
+   可以修改ncnn::set_kmp_blocktime(int)或者修改net.opt.openmp_blocktime，这个参数是ncnn API设置的spin time，默认是20ms。
+   可以根据情况设置更小的值，或者直接改为0。
+
+   局限：目前只有clang的libomp库有实现，vcomp和libgomp都没有相应接口，如果不是使用clang编译的，这个值默认还是200ms。
+   如果使用vcomp或libgomp, 可以使用环境变量OMP_WAIT_POLICY=PASSIVE禁用spin time，如果使用simpleomp,不需要设置这个参数。
+```
+4. 限制openmp线程池可用线程数量。
+```
+   即使减小了openmp线程数量，cpu占用率仍然可能会很高。这在cpu核心特别多的服务器上比较常见。这是因为线程池中的等待线程使用
+   自旋锁忙等待，可以通过限制线程池可用线程数量减轻这种影响。
+
+   一般可以通过设置OMP_THREAD_LIMIT环境变量。simpleomp目前不支持这一特性，不需要设置。注意这个环境变量仅在程序启动前设置才有效。
+```
+5. 完全禁用openmp
+```
+   如果只有一个cpu核心，或者使用vulkan加速，建议关闭openmp, cmake编译时指定-DNCNN_OPENMP=OFF即可。
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/quantized-int8-inference.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/quantized-int8-inference.md
@ -0,0 +1,71 @@
+# Post Training Quantization Tools
+
+To support int8 model deployment on mobile devices,we provide the universal post training quantization tools which can convert the float32 model to int8 model.
+
+## User Guide
+
+Example with mobilenet, just need three steps.
+
+### 1. Optimize model
+
+```shell
+./ncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 0
+```
+
+### 2. Create the calibration table file
+
+We suggest that using the verification dataset for calibration, which is more than 5000 images.
+
+Some imagenet sample images here https://github.com/nihui/imagenet-sample-images
+
+```shell
+find images/ -type f > imagelist.txt
+./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist.txt mobilenet.table mean=[104,117,123] norm=[0.017,0.017,0.017] shape=[224,224,3] pixel=BGR thread=8 method=kl
+```
+
+* mean and norm are the values you passed to ```Mat::substract_mean_normalize()```
+* shape is the blob shape of your model, [w,h] or [w,h,c]
+
+>
+    * if w and h both are given, image will be resized to exactly size.
+    * if w and h both are zero or negative, image will not be resized.
+    * if only h is zero or negative, image's width will scaled resize to w, keeping aspect ratio.
+    * if only w is zero or negative, image's height will scaled resize to h
+
+* pixel is the pixel format of your model, image pixels will be converted to this type before ```Extractor::input()```
+* thread is the CPU thread count that could be used for parallel inference
+* method is the post training quantization algorithm, kl and aciq are currently supported
+
+If your model has multiple input nodes, you can use multiple list files and other parameters
+
+```shell
+./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist-bgr.txt,imagelist-depth.txt mobilenet.table mean=[104,117,123],[128] norm=[0.017,0.017,0.017],[0.0078125] shape=[224,224,3],[224,224,1] pixel=BGR,GRAY thread=8 method=kl
+```
+
+### 3. Quantize model
+
+```shell
+./ncnn2int8 mobilenet-opt.param mobilenet-opt.bin mobilenet-int8.param mobilenet-int8.bin mobilenet.table
+```
+
+## use ncnn int8 inference
+
+the ncnn library would use int8 inference automatically, nothing changed in your code
+
+```cpp
+ncnn::Net mobilenet;
+mobilenet.load_param("mobilenet-int8.param");
+mobilenet.load_model("mobilenet-int8.bin");
+```
+
+## mixed precision inference
+
+Before quantize your model, comment the layer weight scale line in table file, then the layer will do the float32 inference
+
+```
+conv1_param_0 156.639840536
+```
+
+```
+#conv1_param_0 156.639840536
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md
@ -0,0 +1,162 @@
+We use alexnet as an example
+
+### prepare caffe prototxt and model
+
+These files will usually generated when trained with caffe
+```
+train.prototxt
+deploy.prototxt
+snapshot_10000.caffemodel
+```
+deploy.prototxt and caffemodel file are enough for TEST phase
+
+alexnet deploy.prototxt can be downloaded here
+
+https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet
+
+alexnet caffemodel can be downloaded here
+
+http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel
+
+### convert to ncnn model
+
+Convert old caffe prototxt and caffemodel to new ones using tools in caffe
+
+because the ncnn convert tool needs the new format
+```
+upgrade_net_proto_text [old prototxt] [new prototxt]
+upgrade_net_proto_binary [old caffemodel] [new caffemodel]
+```
+
+Use Input layer as input, set N dim as 1 since only one image can be processed each time
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+Use caffe2ncnn tool to convert caffe model to ncnn model
+```
+caffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin
+```
+
+### strip visible string
+
+It is already enough for deploying with param and bin file only, but there are visible strings in param file, it may not be suitable to distribute plain neural network information in your APP.
+
+You can use ncnn2mem tool to convert plain model file to binary representation. It will generate alexnet.param.bin and two static array code files.
+```
+ncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h
+```
+
+### load model
+
+Load param and bin file, the easy way
+```cpp
+ncnn::Net net;
+net.load_param("alexnet.param");
+net.load_model("alexnet.bin");
+```
+Load binary param.bin and bin file, no visible strings included, suitable for bundled as APP resource
+```cpp
+ncnn::Net net;
+net.load_param_bin("alexnet.param.bin");
+net.load_model("alexnet.bin");
+```
+Load network and model from external memory, no visible strings included, no external resource files bundled, the whole model is hardcoded in your program
+
+You may use this way to load from android asset resource
+```cpp
+#include "alexnet.mem.h"
+ncnn::Net net;
+net.load_param(alexnet_param_bin);
+net.load_model(alexnet_bin);
+```
+You can choose either way to load model. Loading from external memory is zero-copy, which means you must keep your memory buffer during processing
+
+### unload model
+```cpp
+net.clear();
+```
+
+### input and output
+
+ncnn Mat is the data structure for input and output data
+
+Input image should be converted to Mat, and subtracted mean values and normalized when needed
+
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+ncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);
+
+const float mean_vals[3] = {104.f, 117.f, 123.f};
+in.substract_mean_normalize(mean_vals, 0);
+```
+Execute the network inference and retrieve the result
+```cpp
+#include "net.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input("data", in);
+ex.extract("prob", out);
+```
+If you load model with binary param.bin file, you should use the enum value in alexnet.id.h file instead of the blob name
+```cpp
+#include "net.h"
+#include "alexnet.id.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input(alexnet_param_id::BLOB_data, in);
+ex.extract(alexnet_param_id::BLOB_prob, out);
+```
+Read the data in the output Mat. Iterate data to get all classification scores.
+```cpp
+ncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);
+std::vector<float> scores;
+scores.resize(out_flatterned.w);
+for (int j=0; j<out_flatterned.w; j++)
+{
+    scores[j] = out_flatterned[j];
+}
+```
+
+### some tricks
+
+Set multithreading thread number with Extractor
+```cpp
+ex.set_num_threads(4);
+```
+Convert image colorspace and resize image with Mat convenient function, these functions are well optimized
+
+Support RGB2GRAY GRAY2RGB RGB2BGR etc, support scale up and scale down
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+int target_width = 227;// target resized width
+int target_height = 227;// target resized height
+ncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);
+```
+You can concat multiple model files into one, and load this single file from FILE* interface.
+
+It should ease the distribution of param and model files.
+
+> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin
+
+```cpp
+#include "net.h"
+FILE* fp = fopen("alexnet-all.bin", "rb");
+net.load_param_bin(fp);
+net.load_model(fp);
+fclose(fp);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.zh.md
@ -0,0 +1,149 @@
+首先，非常感谢大家对 ncnn 组件的关注
+为了方便大家使用 ncnn 组件，up主特意写了这篇使用指北，以烂大街的 alexnet 作为例子
+
+
+### 准备caffe网络和模型
+
+caffe 的网络和模型通常是搞深度学习的研究者训练出来的，一般来说训练完会有
+```
+train.prototxt
+deploy.prototxt
+snapshot_10000.caffemodel
+```
+部署的时候只需要 TEST 过程，所以有 deploy.prototxt 和 caffemodel 就足够了
+
+alexnet 的 deploy.prototxt 可以在这里下载
+https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet
+
+alexnet 的 caffemodel 可以在这里下载
+http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel
+
+### 转换ncnn网络和模型
+
+caffe 自带了工具可以把老版本的 caffe 网络和模型转换为新版（ncnn的工具只认识新版
+```
+upgrade_net_proto_text [老prototxt] [新prototxt]
+upgrade_net_proto_binary [老caffemodel] [新caffemodel]
+```
+输入层改用 Input，因为每次只需要做一个图片，所以第一个 dim 设为 1
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+使用 caffe2ncnn 工具转换为 ncnn 的网络描述和模型
+```
+caffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin
+```
+### 去除可见字符串
+
+有 param 和 bin 文件其实已经可以用了，但是 param 描述文件是明文的，如果放在 APP 分发出去容易被窥探到网络结构（说得好像不明文就看不到一样
+使用 ncnn2mem 工具转换为二进制描述文件和内存模型，生成 alexnet.param.bin 和两个静态数组的代码文件
+```
+ncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h
+```
+### 加载模型
+
+直接加载 param 和 bin，适合快速验证效果使用
+```cpp
+ncnn::Net net;
+net.load_param("alexnet.param");
+net.load_model("alexnet.bin");
+```
+加载二进制的 param.bin 和 bin，没有可见字符串，适合 APP 分发模型资源
+```cpp
+ncnn::Net net;
+net.load_param_bin("alexnet.param.bin");
+net.load_model("alexnet.bin");
+```
+从内存引用加载网络和模型，没有可见字符串，模型数据全在代码里头，没有任何外部文件
+另外，android apk 打包的资源文件读出来也是内存块
+```cpp
+#include "alexnet.mem.h"
+ncnn::Net net;
+net.load_param(alexnet_param_bin);
+net.load_model(alexnet_bin);
+```
+以上三种都可以加载模型，其中内存引用方式加载是 zero-copy 的，所以使用 net 模型的来源内存块必须存在
+
+### 卸载模型
+```cpp
+net.clear();
+```
+
+### 输入和输出
+
+ncnn 用自己的数据结构 Mat 来存放输入和输出数据
+输入图像的数据要转换为 Mat，依需要减去均值和乘系数
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+ncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);
+
+const float mean_vals[3] = {104.f, 117.f, 123.f};
+in.substract_mean_normalize(mean_vals, 0);
+```
+执行前向网络，获得计算结果
+```cpp
+#include "net.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input("data", in);
+ex.extract("prob", out);
+```
+如果是二进制的 param.bin 方式，没有可见字符串，利用 alexnet.id.h 的枚举来代替 blob 的名字
+```cpp
+#include "net.h"
+#include "alexnet.id.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input(alexnet_param_id::BLOB_data, in);
+ex.extract(alexnet_param_id::BLOB_prob, out);
+```
+获取 Mat 中的输出数据，Mat 内部的数据通常是三维的，c / h / w，遍历所有获得全部分类的分数
+```cpp
+ncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);
+std::vector<float> scores;
+scores.resize(out_flatterned.w);
+for (int j=0; j<out_flatterned.w; j++)
+{
+    scores[j] = out_flatterned[j];
+}
+```
+### 某些使用技巧
+
+Extractor 有个多线程加速的开关，设置线程数能加快计算
+```cpp
+ex.set_num_threads(4);
+```
+Mat 转换图像的时候可以顺便转换颜色和缩放大小，这些顺带的操作也是有优化的
+支持 RGB2GRAY GRAY2RGB RGB2BGR 等常用转换，支持缩小和放大
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+int target_width = 227;// target resized width
+int target_height = 227;// target resized height
+ncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);
+```
+Net 有从 FILE* 文件描述加载的接口，可以利用这点把多个网络和模型文件合并为一个，分发时能方便些，内存引用就无所谓了
+
+> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin
+
+```cpp
+#include "net.h"
+FILE* fp = fopen("alexnet-all.bin", "rb");
+net.load_param_bin(fp);
+net.load_model(fp);
+fclose(fp);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-opencv.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-opencv.md
@ -0,0 +1,135 @@
+### opencv to ncnn
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + swap RGB/BGR
+
+```cpp
+// cv::Mat a(h, w, CV_8UC3);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB, a.cols, a.rows);
+```
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + keep RGB/BGR order
+
+```cpp
+// cv::Mat a(h, w, CV_8UC3);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_RGB, a.cols, a.rows);
+```
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 1 channel + do RGB2GRAY/BGR2GRAY
+
+```cpp
+// cv::Mat rgb(h, w, CV_8UC3);
+ncnn::Mat inrgb = ncnn::Mat::from_pixels(rgb.data, ncnn::Mat::PIXEL_RGB2GRAY, rgb.cols, rgb.rows);
+
+// cv::Mat bgr(h, w, CV_8UC3);
+ncnn::Mat inbgr = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2GRAY, bgr.cols, bgr.rows);
+```
+
+* cv::Mat CV_8UC1 -> ncnn::Mat 1 channel
+
+```cpp
+// cv::Mat a(h, w, CV_8UC1);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_GRAY, a.cols, a.rows);
+```
+
+* cv::Mat CV_32FC1 -> ncnn::Mat 1 channel
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// cv::Mat a(h, w, CV_32FC1);
+ncnn::Mat in(a.cols, a.rows, 1, (void*)a.data);
+in = in.clone();
+```
+
+* cv::Mat CV_32FC3 -> ncnn::Mat 3 channel
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// cv::Mat a(h, w, CV_32FC3);
+ncnn::Mat in_pack3(a.cols, a.rows, 1, (void*)a.data, (size_t)4u * 3, 3);
+ncnn::Mat in;
+ncnn::convert_packing(in_pack3, in, 1);
+```
+
+* std::vector < cv::Mat > + CV_32FC1 -> ncnn::Mat multiple channels
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// std::vector<cv::Mat> a(channels, cv::Mat(h, w, CV_32FC1));
+int channels = a.size();
+ncnn::Mat in(a[0].cols, a[0].rows, channels);
+for (int p=0; p<in.c; p++)
+{
+    memcpy(in.channel(p), (const uchar*)a[p].data, in.w * in.h * sizeof(float));
+}
+```
+
+### ncnn to opencv
+
+* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + swap RGB/BGR
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+cv::Mat a(in.h, in.w, CV_8UC3);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB);
+```
+
+* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + keep RGB/BGR order
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+cv::Mat a(in.h, in.w, CV_8UC3);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_RGB);
+```
+
+* ncnn::Mat 1 channel -> cv::Mat CV_8UC1
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 1);
+cv::Mat a(in.h, in.w, CV_8UC1);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_GRAY);
+```
+
+* ncnn::Mat 1 channel -> cv::Mat CV_32FC1
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in;
+cv::Mat a(in.h, in.w, CV_32FC1);
+memcpy((uchar*)a.data, in.data, in.w * in.h * sizeof(float));
+```
+
+* ncnn::Mat 3 channel -> cv::Mat CV_32FC3
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+ncnn::Mat in_pack3;
+ncnn::convert_packing(in, in_pack3, 3);
+cv::Mat a(in.h, in.w, CV_32FC3);
+memcpy((uchar*)a.data, in_pack3.data, in.w * in.h * 3 * sizeof(float));
+```
+
+* ncnn::Mat multiple channels -> std::vector < cv::Mat > + CV_32FC1
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in(w, h, channels);
+std::vector<cv::Mat> a(in.c);
+for (int p=0; p<in.c; p++)
+{
+    a[p] = cv::Mat(in.h, in.w, CV_32FC1);
+    memcpy((uchar*)a[p].data, in.channel(p), in.w * in.h * sizeof(float));
+}
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-own-project.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-own-project.md
@ -0,0 +1,48 @@
+### use ncnn with own project
+
+After building ncnn, there is one or more library files generated. Consider integrating ncnn into your own project, you may use ncnn's installating provided cmake config file, or by manually specify library path(s).
+
+**with cmake**
+
+Ensure your project is built by cmake. Then in your project's CMakeLists.txt, add these lines:
+
+```cmake
+set(ncnn_DIR "<ncnn_install_dir>/lib/cmake/ncnn" CACHE PATH "Directory that contains ncnnConfig.cmake")
+find_package(ncnn REQUIRED)
+target_link_libraries(my_target ncnn)
+```
+After this, both the header file search path ("including directories") and library paths are configured automatically, including vulkan related dependencies.
+
+Note: you have to change `<ncnn_install_dir>` to your machine's directory, it is the directory that contains `ncnnConfig.cmake`.
+
+For the prebuilt ncnn release packages, ncnnConfig is located in:
+- for `ncnn-YYYYMMDD-windows-vs2019`, it is `lib/cmake/ncnn`
+- for `ncnn-YYYYMMDD-android-vulkan`, it is `${ANDROID_ABI}/lib/cmake/ncnn` (`${ANDROID_ABI}` is defined in NDK's cmake toolchain file)
+- other prebuilt release packages are with similar condition
+
+**manually specify**
+
+You may also manually specify ncnn library path and including directory. Note that if you use ncnn with vulkan, it is also required to specify vulkan related dependencies.
+
+For example, on Visual Studio debug mode with vulkan required, the lib paths are:
+```
+E:\github\ncnn\build\vs2019-x64\install\lib\ncnnd.lib
+E:\lib\VulkanSDK\1.2.148.0\Lib\vulkan-1.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\SPIRVd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\glslangd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\MachineIndependentd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OGLCompilerd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OSDependentd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\GenericCodeGend.lib
+```
+And for its release mode, lib paths are:
+```
+E:\github\ncnn\build\vs2019-x64\install\lib\ncnn.lib
+E:\lib\VulkanSDK\1.2.148.0\Lib\vulkan-1.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\SPIRV.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\glslang.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\MachineIndependent.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OGLCompiler.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OSDependent.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\GenericCodeGen.lib
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-pytorch-or-onnx.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-pytorch-or-onnx.md
@ -0,0 +1,55 @@
+Here is a practical guide for converting pytorch model to ncnn
+
+resnet18 is used as the example
+
+## pytorch to onnx
+
+The official pytorch tutorial for exporting onnx model
+
+https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
+
+```python
+import torch
+import torchvision
+import torch.onnx
+
+# An instance of your model
+model = torchvision.models.resnet18()
+
+# An example input you would normally provide to your model's forward() method
+x = torch.rand(1, 3, 224, 224)
+
+# Export the model
+torch_out = torch.onnx._export(model, x, "resnet18.onnx", export_params=True)
+```
+
+## simplify onnx model
+
+The exported resnet18.onnx model may contains many redundant operators such as Shape, Gather and Unsqueeze that is not supported in ncnn
+
+```
+Shape not supported yet!
+Gather not supported yet!
+  # axis=0
+Unsqueeze not supported yet!
+  # axes 7
+Unsqueeze not supported yet!
+  # axes 7
+```
+
+Fortunately, daquexian developed a handy tool to eliminate them. cheers!
+
+https://github.com/daquexian/onnx-simplifier
+
+```
+python3 -m onnxsim resnet18.onnx resnet18-sim.onnx
+```
+
+## onnx to ncnn
+
+Finally, you can convert the model to ncnn using tools/onnx2ncnn
+
+```
+onnx2ncnn resnet18-sim.onnx resnet18.param resnet18.bin
+```
+
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnnoptimize-to-optimize-model.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnnoptimize-to-optimize-model.md
@ -0,0 +1,25 @@
+
+the typical usage
+```
+ncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 65536 
+```
+
+operator fusion
+* batchnorm - scale
+* convolution - batchnorm
+* convolutiondepthwise - batchnorm
+* deconvolution - batchnorm
+* deconvolutiondepthwise - batchnorm
+* innerproduct - batchnorm
+* convolution - relu
+* convolutiondepthwise - relu
+* deconvolution - relu
+* deconvolutiondepthwise - relu
+* innerproduct - relu
+
+eliminate noop operator
+* innerproduct - dropout
+* flatten after global pooling
+
+prefer better operator
+* replace convolution with innerproduct after global pooling
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/vulkan-notes.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/vulkan-notes.md
@ -0,0 +1,173 @@
+## supported platform
+
+* Y = known work
+* ? = shall work, not confirmed
+* / = not applied
+
+|    |windows|linux|android|mac|ios|
+|---|---|---|---|---|---|
+|intel|Y|Y|?|?|/|
+|amd|Y|Y|/|?|/|
+|nvidia|Y|Y|?|/|/|
+|qcom|/|/|Y|/|/|
+|apple|/|/|/|?|Y|
+|arm|/|?|?|/|/|
+
+## enable vulkan compute support
+```
+$ sudo dnf install vulkan-devel
+$ cmake -DNCNN_VULKAN=ON ..
+```
+
+## enable vulkan compute inference
+```cpp
+ncnn::Net net;
+net.opt.use_vulkan_compute = 1;
+```
+
+## proper allocator usage
+```cpp
+ncnn::VkAllocator* blob_vkallocator = vkdev.acquire_blob_allocator();
+ncnn::VkAllocator* staging_vkallocator = vkdev.acquire_blob_allocator();
+
+net.opt.blob_vkallocator = blob_vkallocator;
+net.opt.workspace_vkallocator = blob_vkallocator;
+net.opt.staging_vkallocator = staging_vkallocator;
+
+// ....
+
+// after inference
+vkdev.reclaim_blob_allocator(blob_vkallocator);
+vkdev.reclaim_staging_allocator(staging_vkallocator);
+```
+
+## select gpu device
+```cpp
+// get gpu count
+int gpu_count = ncnn::get_gpu_count();
+
+// set specified vulkan device before loading param and model
+net.set_vulkan_device(0); // use device-0
+net.set_vulkan_device(1); // use device-1
+```
+
+## zero-copy on unified memory device
+```cpp
+ncnn::VkMat blob_gpu;
+ncnn::Mat mapped = blob_gpu.mapped();
+
+// use mapped.data directly
+```
+
+## hybrid cpu/gpu inference
+```cpp
+ncnn::Extractor ex_cpu = net.create_extractor();
+ncnn::Extractor ex_gpu = net.create_extractor();
+ex_cpu.set_vulkan_compute(false);
+ex_gpu.set_vulkan_compute(true);
+
+#pragma omp parallel sections
+{
+    #pragma omp section
+    {
+        ex_cpu.input();
+        ex_cpu.extract();
+    }
+    #pragma omp section
+    {
+        ex_gpu.input();
+        ex_gpu.extract();
+    }
+}
+```
+
+## zero-copy gpu inference chaining
+```cpp
+ncnn::Extractor ex1 = net1.create_extractor();
+ncnn::Extractor ex2 = net2.create_extractor();
+
+ncnn::VkCompute cmd(&vkdev);
+
+ncnn::VkMat conv1;
+ncnn::VkMat conv2;
+ncnn::VkMat conv3;
+
+ex1.input("conv1", conv1);
+ex1.extract("conv2", conv2, cmd);
+
+ex2.input("conv2", conv2);
+ex2.extract("conv3", conv3, cmd);
+
+cmd.submit();
+
+cmd.wait();
+
+```
+
+## batch inference
+```cpp
+int max_batch_size = vkdev->info.compute_queue_count;
+
+ncnn::Mat inputs[1000];
+ncnn::Mat outputs[1000];
+
+#pragma omp parallel for num_threads(max_batch_size)
+for (int i=0; i<1000; i++)
+{
+    ncnn::Extractor ex = net1.create_extractor();
+    ex.input("data", inputs[i]);
+    ex.extract("prob", outputs[i]);
+}
+```
+
+## control storage and arithmetic precision
+
+disable all lower-precision optimizations, get full fp32 precision
+
+```cpp
+ncnn::Net net;
+net.opt.use_fp16_packed = false;
+net.opt.use_fp16_storage = false;
+net.opt.use_fp16_arithmetic = false;
+net.opt.use_int8_storage = false;
+net.opt.use_int8_arithmetic = false;
+```
+
+## debugging tips
+```cpp
+#define ENABLE_VALIDATION_LAYER 1 // modify to 1 in gpu.cpp
+```
+
+## add vulkan compute support to layer
+1. add vulkan shader in src/layer/shader/
+
+2. upload model weight data in Layer::upload_model()
+
+3. setup pipeline in Layer::create_pipeline()
+
+4. destroy pipeline in Layer::destroy_pipeline()
+
+5. record command in Layer::forward()
+
+## add optimized shader path
+1. add vulkan shader in src/layer/shader/ named XXX_abc.comp
+
+2. create pipeline with "XXX_abc"
+
+3. record command using XXX_abc pipeline
+
+## low-level op api
+1. create layer
+
+2. load param and load model
+
+3. upload model
+
+4. create pipeline
+
+5. new command
+
+6. record
+
+7. submit and wait
+