feat: 切换后端至PaddleOCR-NCNN，切换工程为CMake

1.项目后端整体迁移至PaddleOCR-NCNN算法，已通过基本的兼容性测试 2.工程改为使用CMake组织，后续为了更好地兼容第三方库，不再提供QMake工程 3.重整权利声明文件，重整代码工程，确保最小化侵权风险 Log: 切换后端至PaddleOCR-NCNN，切换工程为CMake Change-Id: I4d5d2c5d37505a4a24b389b1a4c5d12f17bfa38c
2022-05-10 09:54:44 +08:00
parent ecdd171c6f
commit 718c41634f
10018 changed files with 3593797 additions and 186748 deletions
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-produce-wrong-result.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-produce-wrong-result.md
@ -0,0 +1,172 @@
+### caffemodel should be row-major
+
+`caffe2ncnn` tool assumes the caffemodel is row-major (produced by c++ caffe train command).
+
+The kernel 3x3 weights should be stored as
+```
+a b c
+d e f
+g h i
+```
+
+However, matlab caffe produced col-major caffemodel.
+
+You have to transpose all the kernel weights by yourself or re-training using c++ caffe train command.
+
+Besides, you may interest in https://github.com/conanhujinming/matcaffe2caffe
+
+### check input is RGB or BGR
+
+If your caffemodel is trained using c++ caffe and opencv, then the input image should be BGR order.
+
+If your model is trained using matlab caffe or pytorch or mxnet or tensorflow, the input image would probably be RGB order.
+
+The channel order can be changed on-the-fly through proper pixel type enum
+```
+// construct RGB blob from rgb image
+ncnn::Mat in_rgb = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB, w, h);
+
+// construct BGR blob from bgr image
+ncnn::Mat in_bgr = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR, w, h);
+
+// construct BGR blob from rgb image
+ncnn::Mat in_bgr = ncnn::Mat::from_pixels(rgb_data, ncnn::Mat::PIXEL_RGB2BGR, w, h);
+
+// construct RGB blob from bgr image
+ncnn::Mat in_rgb = ncnn::Mat::from_pixels(bgr_data, ncnn::Mat::PIXEL_BGR2RGB, w, h);
+```
+
+
+### image decoding
+
+JPEG(`.jpg`,`.jpeg`) is loss compression, people may get different pixel value for same image on same position. 
+
+`.bmp` images are recommended instead.
+
+### interpolation / resizing
+
+There are several image resizing methods, which may generate different result for same input image.
+
+Even we specify same interpolation method, different frameworks/libraries and their various versions may also introduce difference.
+
+A good practice is feed same size image as the input layer expected, e.g. read a 224x244 bmp image when input layer need 224x224 size.
+
+
+### Mat::from_pixels/from_pixels_resize assume that the pixel data is continuous
+
+You shall pass continuous pixel buffer to from_pixels family.
+
+If your image is an opencv submat from an image roi, call clone() to get a continuous one.
+```
+cv::Mat image;// the image
+cv::Rect facerect;// the face rectangle
+
+cv::Mat faceimage = image(facerect).clone();// get a continuous sub image
+
+ncnn::Mat in = ncnn::Mat::from_pixels(faceimage.data, ncnn::Mat::PIXEL_BGR, faceimage.cols, faceimage.rows);
+```
+
+### pre process
+Apply pre process according to your training configuration
+
+Different model has different pre process config, you may find the following transform config in Data layer section
+```
+transform_param {
+    mean_value: 103.94
+    mean_value: 116.78
+    mean_value: 123.68
+    scale: 0.017
+}
+```
+Then the corresponding code for ncnn pre process is
+```cpp
+const float mean_vals[3] = { 103.94f, 116.78f, 123.68f };
+const float norm_vals[3] = { 0.017f, 0.017f, 0.017f };
+in.substract_mean_normalize(mean_vals, norm_vals);
+```
+
+Mean file is not supported currently
+
+So you have to pre process the input data by yourself (use opencv or something)
+```
+transform_param {
+    mean_file: "imagenet_mean.binaryproto"
+}
+```
+
+For pytorch or mxnet-gluon
+```python
+transforms.ToTensor(),
+transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
+```
+Then the corresponding code for ncnn pre process is
+```cpp
+// R' = (R / 255 - 0.485) / 0.229 = (R - 0.485 * 255) / 0.229 / 255
+// G' = (G / 255 - 0.456) / 0.224 = (G - 0.456 * 255) / 0.224 / 255
+// B' = (B / 255 - 0.406) / 0.225 = (B - 0.406 * 255) / 0.225 / 255
+const float mean_vals[3] = {0.485f*255.f, 0.456f*255.f, 0.406f*255.f};
+const float norm_vals[3] = {1/0.229f/255.f, 1/0.224f/255.f, 1/0.225f/255.f};
+in.substract_mean_normalize(mean_vals, norm_vals);
+```
+
+### use the desired blob
+The blob names for input and extract are differ among models.
+
+For example, squeezenet v1.1 use "data" as input blob and "prob" as output blob while mobilenet-ssd use "data" as input blob and "detection_out" as output blob.
+
+Some models may need multiple input or produce multiple output.
+
+```cpp
+ncnn::Extractor ex = net.create_extractor();
+
+ex.input("data", in);// change "data" to yours
+ex.input("mask", mask);// change "mask" to yours
+
+ex.extract("output1", out1);// change "output1" to yours
+ex.extract("output2", out2);// change "output2" to yours
+```
+
+### blob may have channel gap
+Each channel pointer is aligned by 128bit in ncnn Mat structure.
+
+blob may have gaps between channels if (width x height) can not divided exactly by 4
+
+Prefer using ncnn::Mat::from_pixels or ncnn::Mat::from_pixels_resize for constructing input blob from image data
+
+If you do need a continuous blob buffer, reshape the output.
+```cpp
+// out is the output blob extracted
+ncnn::Mat flattened_out = out.reshape(out.w * out.h * out.c);
+
+// plain array, C-H-W
+const float* outptr = flattened_out;
+```
+
+### create new Extractor for each image
+The `ncnn::Extractor` object is stateful, if you reuse for different input, you will always get exact the same result cached inside.
+
+Always create new Extractor to process images in loop unless you do know how the stateful Extractor works.
+```cpp
+for (int i=0; i<count; i++)
+{
+    // always create Extractor
+    // it's cheap and almost instantly !
+    ncnn::Extractor ex = net.create_extractor();
+
+    // use
+    ex.input(your_data[i]);
+}
+```
+
+### use proper loading api
+
+If you want to load plain param file buffer, you shall use Net::load_param_mem instead of Net::load_param.
+
+For more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)
+
+```cpp
+ncnn::Net net;
+
+// param_buffer is the content buffe of XYZ.param file
+net.load_param_mem(param_buffer);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-protobuf-problem.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-protobuf-problem.zh.md
@ -0,0 +1,73 @@
+### Linux 编译 `caffe2ncnn` 时报 `Protobuf not found`
+
+一般是因为 protobuf 未安装或环境变量未设置
+
+1. 安装 protobuf
+
+Ubuntu 系统尝试以下命令
+> sudo apt-get install libprotobuf-dev protobuf-compiler
+
+CentOS 尝试
+> sudo yum install protobuf-devel.x86_64 protobuf-compiler.x86_64
+
+2. 然后设置 C++ 环境
+打开`~/.bashrc`，在末尾增加
+> export LD_LIBRARY_PATH=${YOUR_PROTOBUF_LIB_PATH}:$LD_LIBRARY_PATH
+
+3. 让配置生效
+> source ~/.bashrc
+
+
+### 编译 `caffe2ncnn` 时报 protoc 和 protobuf.so 版本不匹配
+
+一般是因为系统安装了不止一个 protobuf。
+
+#### 直接改链接路径
+1. 先看 protoc 需要的 so 版本号
+> ldd \`whereis protoc| awk '{print $2}'\` | grep libprotobuf.so
+
+例如是 libprotobuf.so.10
+
+2. 然后搜这个文件所在的路径
+> cd / && find . -type f | grep libprotobuf.so.10
+
+假设在`/home/user/mydir`
+
+3. 设置 protobuf.so 的搜索目录
+打开`~/.bashrc`，在末尾增加
+> export LD_LIBRARY_PATH=/home/user/mydir:$LD_LIBRARY_PATH
+
+4. 让配置生效
+> source ~/.bashrc
+
+#### 如果以上办法不行的话，尝试源码安装 protobuf
+
+1. 首先在 [protobuf/releases](https://github.com/protocolbuffers/protobuf/releases/tag/v3.10.0) 下载所需的 pb 版本，例如需要 v3.10.0 。注意要下载 -cpp 后缀的压缩包。
+
+2. 解压到某一目录，然后编译
+>  tar xvf protobuf-cpp-3.10.0.tar.gz && cd protobuf-3.10.0/
+./configure --prefix=/your_install_dir && make -j 3 && make install
+
+3. **不不不要**忽略`--prefix`直接安装到系统目录，源码编译好的 so 和头文件在`your_install_dir`里
+
+4. 设置 protobuf.so 的搜索目录
+打开`~/.bashrc`，在末尾增加
+
+```bash
+export LD_LIBRARY_PATH=/your_install_dir/lib:$LD_LIBRARY_PATH
+export CPLUS_INCLUDE_PATH=/your_install_dir/include:$CPLUS_INCLUDE_PATH
+```
+
+5. 让配置生效
+> source ~/.bashrc
+
+#### 如果以上办法还不行
+尝试删除已有protobuf（注意不要删到系统自带的，新手请谨慎），然后用以下命令重装所需的 so
+> sudo apt-get install --reinstall libprotobuf8
+
+版本号需改为自己的版本号
+
+### Windows 出现此类问题，基本思路也是 IDE 改环境变量
+
+### 行走江湖必备
+关于环境变量设置、工具和技巧，强烈建议学习下 https://missing.csail.mit.edu/ 
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-throw-error.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-throw-error.md
@ -0,0 +1,129 @@
+### param is too old, please regenerate
+
+Your model file is being the old format converted by an old caffe2ncnn tool.
+
+Checkout the latest ncnn code, build it and regenerate param and model binary files, and that should work.
+
+Make sure that your param file starts with the magic number 7767517.
+
+you may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet)
+
+### find_blob_index_by_name XYZ failed
+
+That means ncnn couldn't find the XYZ blob in the network. 
+
+You shall call Extractor::input()/extract() by blob name instead of layer name.
+
+For models loaded from binary param file or external memory, you shall call Extractor::input()/extract() by the enum defined in xxx.id.h because all the visible string literals have been stripped in binary form.
+
+This error usually happens when the input layer is not properly converted.
+
+You shall upgrade caffe prototxt/caffemodel before converting it to ncnn. Following snippet type shall be ok. 
+
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+
+you may find more info on [use-ncnn-with-alexnet](use-ncnn-with-alexnet).
+
+### layer XYZ not exists or registered
+
+Your network contains some operations that are not implemented in ncnn.
+
+You may implement them as custom layer followed in [how-to-implement-custom-layer-step-by-step](how-to-implement-custom-layer-step-by-step).
+
+Or you could simply register them as no-op if you are sure those operations make no sense.
+
+```cpp
+class Noop : public ncnn::Layer {};
+DEFINE_LAYER_CREATOR(Noop)
+
+net.register_custom_layer("LinearRegressionOutput", Noop_layer_creator);
+net.register_custom_layer("MAERegressionOutput", Noop_layer_creator);
+```
+
+### fopen XYZ.param/XYZ.bin failed
+
+File not found or not readable. Make sure that XYZ.param/XYZ.bin is accessible.
+
+### network graph not ready
+
+You shall call Net::load_param() first, then Net::load_model().
+
+This error may also happens when Net::load_param() failed, but not properly handled.
+
+For more information about the ncnn model load api, see [ncnn-load-model](ncnn-load-model)
+
+### memory not 32-bit aligned at XYZ
+
+The pointer passed to Net::load_param() or Net::load_model() is not 32bit aligned.
+
+In practice, the head pointer of std::vector<unsigned char> is not guaranteed to be 32bit aligned.
+
+you can store your binary buffer in ncnn::Mat structure, its internal memory is aligned.
+
+### undefined reference to '__kmpc_XYZ_XYZ'
+
+use clang for building android shared library
+
+comment the following line in your Application.mk
+```
+NDK_TOOLCHAIN_VERSION := 4.9
+```
+
+### crash on android with '__kmp_abort_process'
+
+This usually happens if you bundle multiple shared library with openmp linked
+
+It is actually an issue of the android ndk https://github.com/android/ndk/issues/1028
+
+On old android ndk, modify the link flags as
+
+```
+-Wl,-Bstatic -lomp -Wl,-Bdynamic
+```
+
+For recent ndk >= 21
+
+```
+-fstatic-openmp
+```
+
+### dlopen failed: library "libomp.so" not found
+
+Newer android ndk defaults to dynamic openmp runtime
+
+modify the link flags as
+
+```
+-fstatic-openmp -fopenmp
+```
+
+### crash when freeing a ncnn dynamic library(*.dll/*.so) built with openMP
+
+for optimal performance, the openmp threadpool spin waits for about a second prior to shutting down in case more work becomes available. 
+
+If you unload a dynamic library that's in the process of spin-waiting, it will crash in the manner you see (most of the time).
+
+Just set OMP_WAIT_POLICY=passive in your environment, before calling loadlibrary. or Just wait a few seconds before calling freelibrary.
+
+You can also use the following method to set environment variables in your code:
+
+for msvc++:
+
+```
+SetEnvironmentVariable(_T("OMP_WAIT_POLICY"), _T("passive"));
+```
+
+for g++:
+
+```
+setenv("OMP_WAIT_POLICY", "passive", 1)
+```
+
+reference: https://stackoverflow.com/questions/34439956/vc-crash-when-freeing-a-dll-built-with-openmp
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-vulkan.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/FAQ-ncnn-vulkan.md
@ -0,0 +1,124 @@
+### how to enable ncnn vulkan capability
+
+follow [the build and install instruction](https://github.com/Tencent/ncnn/blob/master/docs/how-to-build/how-to-build.md)
+
+make sure you have installed vulkan sdk from [lunarg vulkan sdk website](https://vulkan.lunarg.com/sdk/home)
+
+Usually, you can enable the vulkan compute inference feature by adding only one line of code to your application.
+
+```cpp
+// enable vulkan compute feature before loading
+ncnn::Net net;
+net.opt.use_vulkan_compute = 1;
+```
+
+### does my graphics device support vulkan
+
+Some platforms have been tested and known working. In theory, if your platform support vulkan api, either 1.0 or 1.1, it shall work.
+
+* Y = known work
+* ? = shall work, not confirmed
+* / = not applied
+
+|    |windows|linux|android|mac|ios|
+|---|---|---|---|---|---|
+|intel|Y|Y|?|?|/|
+|amd|Y|Y|/|?|/|
+|nvidia|Y|Y|?|/|/|
+|qcom|/|/|Y|/|/|
+|apple|/|/|/|Y|Y|
+|arm|/|?|Y|/|/|
+
+You can search [the vulkan database](https://vulkan.gpuinfo.org) to see if your device supports vulkan.
+
+Some old buggy drivers may produce wrong result, that are blacklisted in ncnn and treated as non-vulkan capable device.
+You could check if your device and driver have this issue with  [my conformance test here](vulkan-conformance-test).
+Most of these systems are android with version lower than 8.1.
+
+### why using vulkan over cuda/opencl/metal
+
+In the beginning, I had no GPGPU programming experience, and I had to learn one.
+
+vulkan is considered more portable and well supported by venders and the cross-platform low-overhead graphics api. As a contrast, cuda is only available on nvidia device, metal is only available on macos and ios, while loading opencl library is banned in android 7.0+ and does not work on ios.
+
+### I got errors like "vkCreateComputePipelines failed -1000012000" or random stalls or crashes
+
+Upgrade your vulkan driver.
+
+[intel https://downloadcenter.intel.com/product/80939/Graphics-Drivers](https://downloadcenter.intel.com/product/80939/Graphics-Drivers)
+
+[amd https://www.amd.com/en/support](https://www.amd.com/en/support)
+
+[nvidia https://www.nvidia.com/Download/index.aspx](https://www.nvidia.com/Download/index.aspx)
+
+### how to use ncnn vulkan on android
+
+minimum android ndk version: android-ndk-r18b
+
+minimum sdk platform api version: android-24
+
+link your jni project with libvulkan.so
+
+[The squeezencnn example](https://github.com/Tencent/ncnn/tree/master/examples/squeezencnn) have equipped gpu inference, you could take it as reference.
+
+### how to use ncnn vulkan on ios
+
+setup vulkan sdk (https://vulkan.lunarg.com/sdk/home#mac)
+
+metal only works on real device with arm64 cpu (iPhone 5s and later)
+
+link your project with MoltenVK framework and Metal
+
+### what about the layers without vulkan support
+
+These layers have vulkan support currently
+
+AbsVal, BatchNorm, BinaryOp, Cast, Clip, Concat, Convolution, ConvolutionDepthWise, Crop, Deconvolution, DeconvolutionDepthWise, Dropout, Eltwise, Flatten, HardSigmoid, InnerProduct, Interp, LRN, Packing, Padding, Permute, Pooling(pad SAME not supported), PReLU, PriorBox, ReLU, Reorg, Reshape, Scale, ShuffleChannel, Sigmoid, Softmax, TanH, UnaryOp
+
+For these layers without vulkan support, ncnn inference engine will automatically fallback to cpu path.
+
+Thus, it is usually not a serious issue if your network only has some special head layers like SSD or YOLO. All examples in ncnn are known working properly with vulkan enabled.
+
+### my model runs slower on gpu than cpu
+
+The current vulkan inference implementation is far from the preferred state. Many handful optimization techniques are planned, such as winograd convolution, operator fusion, fp16 storage and arithmetic etc.
+
+It is common that your model runs slower on gpu than cpu on arm devices like mobile phones, since we have quite good arm optimization in ncnn ;)
+
+### vulkan device not found / extra high cpu utility while vulkan is enabled on nvidia gpu
+
+There are severel reasons could lead to this outcome. First please check your driver status with `nvidia-smi`. If you have correctly installed your driver, you should see something like this:
+
+```bash
+$ nvidia-smi
+Sat Mar 06 19:53:16 2021
+-----------------------------------------------------------------------------+
+| NVIDIA-SMI 451.48       Driver Version: 451.48       CUDA Version: 11.0     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  GeForce GTX 1060   WDDM  | 00000000:02:00.0 Off |                  N/A |
+| N/A   31C    P8     5W /  N/A |     90MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+
+-----------------------------------------------------------------------------+
+| Processes:                                                                  |
+|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
+|        ID   ID                                                   Usage      |
+|=============================================================================|
+|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
+```
+
+If `nvidia-smi` crashes or cannot be found, please reinstall your graphics driver.
+
+If ncnn *is* utilizing the Tesla GPU, you can see your program in the `Processes` block at the bottom. In that case, it's likely some operators are not yet supported in Vulkan, and have fallbacked to the CPU, thus leading to a low utilization of the GPU.
+
+If you *couldn't* find your process running, plase check the active driver model, which can be found to the right of your device name. For Geforce and Titan GPUs, the default driver model is WDDM (Windows Desktop Driver Model), which supports both rendering graphics as well as computing. But for Tesla GPUs, without configuration, the driver model is defualted to TCC ([Tesla Computing Cluster](https://docs.nvidia.com/gameworks/content/developertools/desktop/tesla_compute_cluster.htm)). NVIDIA's TCC driver does not support Vulkan, so you need to use the following command to set the driver model back to WDDM, to use Vulkan:
+
+```bash
+$ nvidia-smi -g 0 -dm 0
+```
+
+The number following `-g` is the GPU ID (which can be found to the left of your device name in `nvidia-smi` output); and `-dm` stands for driver model, 0 refers to WDDM and 1 means TCC.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/build-minimal-library.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/build-minimal-library.md
@ -0,0 +1,136 @@
+For some reason, if you're not happy with the binary size of the ncnn library, then here is the cheatsheet that helps you to build a minimal ncnn :P
+
+### disable c++ rtti and exceptions
+
+```
+cmake -DNCNN_DISABLE_RTTI=ON -DNCNN_DISABLE_EXCEPTION=ON ..
+```
+* Cannot use RTTI and Exceptions when ncnn functions are called.
+
+### disable vulkan support
+
+```
+cmake -DNCNN_VULKAN=OFF ..
+```
+
+* Cannot use GPU acceleration.
+
+### disable NCNN_STDIO
+
+```
+cmake -DNCNN_STDIO=OFF ..
+```
+
+* Cannot load model from files, but can load model from memory or by Android Assets.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#load-model).
+
+### disable NCNN_STRING
+
+```
+cmake -DNCNN_STRING=OFF ..
+```
+
+* Cannot load human-readable param files with visible strings, but can load binary param.bin files.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#strip-visible-string)
+
+* Cannot identify blobs by string name when calling `Extractor::input / extract`, but can identify them by enum value in `id.h`.
+
+    Read more [here](https://github.com/Tencent/ncnn/blob/master/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md#input-and-output).
+
+### disable NCNN_BF16
+
+```
+cmake -DNCNN_BF16=OFF ..
+```
+
+* Cannot use bf16 storage type in inference.
+
+
+### disable NCNN_INT8
+
+```
+cmake -DNCNN_INT8=OFF ..
+```
+
+* Cannot use quantized int8 inference.
+
+
+### drop pixel drawing functions
+
+```
+cmake -DNCNN_PIXEL_DRAWING=OFF ..
+```
+
+* Cannot use functions doing drawing basic shape and text like `ncnn::draw_rectangle_xx / ncnn::draw_circle_xx / ncnn::draw_text_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available.
+
+
+### drop pixel rotate and affine functions
+
+```
+cmake -DNCNN_PIXEL_ROTATE=OFF -DNCNN_PIXEL_AFFINE=OFF ..
+```
+
+* Cannot use functions doing rotatation and affine transformation like `ncnn::kanna_rotate_xx / ncnn::warpaffine_bilinear_xx`, but functions like `Mat::from_pixels / from_pixels_resize` are still available. 
+
+### drop pixel functions
+
+```
+cmake -DNCNN_PIXEL=OFF ..
+```
+
+* Cannot use functions transferring from image to pixels like `Mat::from_pixels / from_pixels_resize / to_pixels / to_pixels_resize`, and need create a Mat and fill in data by hand.
+
+### disable openmp
+
+```
+cmake -DNCNN_OPENMP=OFF ..
+```
+
+* Cannot use openmp multi-threading acceleration. If you want to run a model in single thread on your target machine, it is recommended to close the option.
+
+### disable avx2 and arm82 optimized kernel
+
+```
+cmake -DNCNN_AVX2=OFF -DNCNN_ARM82=OFF ..
+```
+
+* Do not compile optimized kernels using avx2 / arm82 instruction set extensions. If your target machine does not support some of them, it is recommended to close the related options.
+
+### disable runtime cpu instruction dispatch
+
+```
+cmake -DNCNN_RUNTIME_CPU=OFF ..
+```
+
+* Cannot check supported cpu instruction set extensions and use related optimized kernels in runtime.
+* If you know which instruction set extensions are supported on your target machine like avx2 / arm82, you can open related options like `-DNCNN_AVX2=ON / -DNCNN_ARM82=ON` by hand and then sse2 / arm8 version kernels will not be compiled.
+
+### drop layers not used
+
+```
+cmake -DWITH_LAYER_absval=OFF -DWITH_LAYER_bnll=OFF ..
+```
+
+* If your model does not include some layers, taking absval / bnll as a example above, you can drop them.
+* Some key or dependency layers should not be dropped, like convolution / innerproduct, their dependency like padding / flatten, and activation like relu / clip.
+
+### disable c++ stl
+
+```
+cmake -DNCNN_SIMPLESTL=ON ..
+```
+
+* STL provided by compiler is no longer depended on, and use `simplestl` provided by ncnn as a replacement. Users also can only use `simplestl` when ncnn functions are called.
+* Usually with compiler parameters `-nodefaultlibs -fno-builtin -nostdinc++ -lc`
+* Need cmake parameters `cmake -DCMAKE_TOOLCHAIN_FILE=$ANDROID_NDK/build/cmake/android.toolchain.cmake -DANDROID_STL=system` to avoid STL conflict when compiling to Android.
+
+### drop optimized kernel not used
+
+* Modify the source code under `ncnn/src/layer/arm/` to delete unnecessary optimized kernels or replace them with empty functions.
+* You can also drop layers and related optimized kernels by `-DWITH_LAYER_absval=OFF` as mentioned above.
+
+### drop operators from BinaryOp UnaryOp
+
+* Modify `ncnn/src/layer/binaryop.cpp unaryop.cpp` and `ncnn/src/layer/arm/binaryop.cpp unaryop_arm.cpp` by hand to delete unnecessary operators.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/efficient-roi-resize-rotate.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/efficient-roi-resize-rotate.md
@ -0,0 +1,162 @@
+
+### image roi crop + convert to ncnn::Mat
+
+```
+--------------+
+|   y          |           /-------/
+| x +-------+  |          +-------+|
+|   |     roih |im_h  =>  |      roih
+|   +-roiw--+  |          +-roiw--+/
+|              |
+-----im_w-----+
+```
+```cpp
+ncnn::Mat in = ncnn::Mat::from_pixels_roi(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih);
+```
+For Android Application, it is :
+```cpp
+ncnn::Mat in = ncnn::Mat::from_android_bitmap_roi(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih);
+```
+
+### image roi crop + resize + convert to ncnn::Mat
+
+```
+--------------+
+|   y          |           /----/
+| x +-------+  |          +----+|
+|   |     roih |im_h  =>  |  target_h
+|   +-roiw--+  |          |    ||
+|              |          +----+/
+-----im_w-----+         target_w
+```
+```cpp
+ncnn::Mat in = ncnn::Mat::from_pixels_roi_resize(im.data, ncnn::Mat::PIXEL_RGB, im_w, im_h, x, y, roiw, roih, target_w, target_h);
+```
+For Android Application, it is :
+```cpp
+ncnn::Mat in = ncnn::Mat::from_android_bitmap_roi_resize(env, image, ncnn::Mat::PIXEL_RGBA2RGB, x, y, roiw, roih, target_w, target_h);
+```
+
+### ncnn::Mat export image + offset paste
+
+```
+                +--------------+
+ /-------/      |   y          |
+-------+|      | x +-------+  |
+|       h|  =>  |   |       h  |im_h
+---w---+/      |   +---w---+  |
+                |              |
+                +-----im_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+out.to_pixels(data, ncnn::Mat::PIXEL_RGB, im_w * 3);
+```
+
+### ncnn::Mat export image + resize + roi paste
+
+```
+            +--------------+
+ /----/     |   y          |
+----+|     | x +-------+  |
+|    h| =>  |   |      roih|im_h
+|    ||     |   +-roiw--+  |
+-w--+/     |              |
+            +-----im_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+out.to_pixels_resize(data, ncnn::Mat::PIXEL_RGB, roiw, roih, im_w * 3);
+```
+
+### image roi crop + resize
+```
+--------------+
+|   y          |
+| x +-------+  |          +----+
+|   |      roih|im_h  =>  |  target_h
+|   +-roiw--+  |          |    |
+|              |          +----+
+-----im_w-----+         target_w
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+ncnn::resize_bilinear_c3(data, roiw, roih, im_w * 3, outdata, target_w, target_h, target_w * 3);
+```
+
+### image resize + offset paste
+```
+            +--------------+
+            |   y          |
+----+      | x +-------+  |
+|    h  =>  |   |     roih |im_h
+|    |      |   +-roiw--+  |
+-w--+      |              |
+            +-----im_w-----+
+```
+```cpp
+unsigned char* outdata = im.data + (y * im_w + x) * 3;
+ncnn::resize_bilinear_c3(data, w, h, w * 3, outdata, roiw, roih, im_w * 3);
+```
+
+### image roi crop + resize + roi paste
+```
+--------------+         +-----------------+
+|   y          |         |  roiy           |
+| x +-------+  |         |roix----------+  |
+|   |       h  |im_h  => |   |     target_h|outim_h
+|   +---w---+  |         |   |          |  |
+|              |         |   +-target_w-+  |
+-----im_w-----+         +-----outim_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+unsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;
+ncnn::resize_bilinear_c3(data, w, h, im_w * 3, outdata, target_w, target_h, outim_w * 3);
+```
+
+### image roi crop + rotate
+```
+--------------+
+|   y          |
+| x +-------+  |          +---+
+|   |  < <  h  |im_h  =>  | ^ |w
+|   +---w---+  |          | ^ |
+|              |          +---+
+-----im_w-----+            h
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+ncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, h * 3, 6);
+```
+
+### image rotate + offset paste
+```
+             +--------------+
+             |   y          |
+ +---+       | x +-------+  |
+ | ^ |h  =>  |   |  < <  w  |im_h
+ | ^ |       |   +---h---+  |
+ +---+       |              |
+   w         +-----im_w-----+
+```
+```cpp
+unsigned char* outdata = im.data + (y * im_w + x) * 3;
+ncnn::kanna_rotate_c3(data, w, h, w * 3, outdata, h, w, im_w * 3, 7);
+```
+
+### image roi crop + rotate + roi paste
+```
+--------------+         +-----------------+
+|   y          |         |        roiy     |
+| x +-------+  |         |   roix  +---+   |
+|   |  < <  h  |im_h  => |         | ^ w   |outim_h
+|   +---w---+  |         |         | ^ |   |
+|              |         |         +-h-+   |
+-----im_w-----+         +-----outim_w-----+
+```
+```cpp
+const unsigned char* data = im.data + (y * im_w + x) * 3;
+unsigned char* outdata = outim.data + (roiy * outim_w + roix) * 3;
+ncnn::kanna_rotate_c3(data, w, h, im_w * 3, outdata, h, w, outim_w * 3, 6);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/ncnn-load-model.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/ncnn-load-model.md
@ -0,0 +1,26 @@
+### the comprehensive model loading api table
+
+|load from|alexnet.param|alexnet.param.bin|alexnet.bin|
+|---|---|---|---|
+|file path|load_param(const char*)|load_param_bin(const char*)|load_model(const char*)|
+|file descriptor|load_param(FILE*)|load_param_bin(FILE*)|load_model(FILE*)|
+|file memory|load_param_mem(const char*)|load_param(const unsigned char*)|load_model(const unsigned char*)|
+|android asset|load_param(AAsset*)|load_param_bin(AAsset*)|load_model(AAsset*)|
+|android asset path|load_param(AAssetManager*, const char*)|load_param_bin(AAssetManager*, const char*)|load_model(AAssetManager*, const char*)|
+|custom IO reader|load_param(const DataReader&)|load_param_bin(const DataReader&)|load_model(const DataReader&)|
+
+### points to note
+
+1. Either of the following combination shall be enough for loading model
+    * alexnet.param + alexnet.bin
+    * alexnet.param.bin + alexnet.bin
+
+2. Never modify Net opt member after loading
+
+3. Most loading functions return 0 if success, except loading alexnet.param.bin and alexnet.bin from file memory, which returns the bytes consumed after loading
+    * int Net::load_param(const unsigned char*)
+    * int Net::load_model(const unsigned char*)
+
+4. It is recommended to load model from Android asset directly to avoid copying them to sdcard on Android platform
+
+5. The custom IO reader interface can be used to implement on-the-fly model decryption and loading
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.md
@ -0,0 +1,74 @@
+ncnn openmp best practice
+
+### CPU loadaverage is too high with ncnn.
+
+   When inference the neural network with ncnn, the cpu occupancy is very high even all CPU cores occupancy close to 100%.
+
+   If there are other threads or processes that require more cpu resources, the running speed of the program will drop severely.
+
+### The root cause of high CPU usage
+
+1. ncnn uses openmp API to speed up the inference compute. the thread count equals to the cpu core   count. If the computing work need to run frequently, it must consume many cpu resources.
+
+2. There is a thread pool managed by openmp, the pool size is equal to the cpu core size. (the max  vulue is 15 if there are much more cpu cores?)
+   Openmp need to sync the thread when acquiring and returning threads to the pool. In order to improve efficiency, almost all omp implementations use spinlock synchronization (except for simpleomp). 
+   The default spin time of the spinlock is 200ms. So after a thread is scheduled, the thread need to busy-wait up to 200ms.
+
+### Why the CPU usage is still high even using vulkan GPU acceleration.
+
+1. Openmp is also used when loading the param bin file, and this part runs on cpu.
+
+2. The fp32 to fp16 conversion before and after the GPU memory upload is executed on the cpu, and this part of the logic also uses openmp.
+
+### Solution
+```
+1. Bind to the specific cpu core.
+```
+   If you use a device with large and small core CPUs, it is recommended to bind large or small cores through ncnn::set_cpu_powersave(int). Note that Windows does not support binding cores. By the way,  it's possible to have multiple threadpool using openmp. A new threadpool will be created for a new thread scope.
+Suppose your platform is 2 big cores + 4 little cores, and you want to execute model A on 2 big cores and model B on 4 little cores concurrently.
+
+create two threads via std::thread or pthread
+   ```
+   void thread_1()
+   {
+      ncnn::set_cpu_powersave(2); // bind to big cores
+      netA.opt.num_threads = 2;
+   }
+
+   void thread_2()
+   {
+      ncnn::set_cpu_powersave(1); // bind to little cores
+      netB.opt.num_threads = 4;
+   }
+   ```
+   
+```
+2. Use fewer threads.
+```
+   Set the number of threads to half of the cpu cores count or less through ncnn::set_omp_num_threads(int)  or change net.opt.num_threads field. If you are coding with clang libomp, it's recommended that the number of threads does not exceed 8. If you use other omp libraries, it is recommended that the number of threads does not exceed 4.
+```
+3. Reduce openmp spinlock blocktime.
+```
+   You can modify openmp blocktime by call ncnn::set_kmp_blocktime(int) method or modify net.opt.openmp_blocktime field.
+   This argument is the spin time set by the ncnn API, and the default is 20ms.You can set a smaller value according to
+   the situation, or directly change it to 0.
+
+   Limitations: At present, only the libomp library of clang is implemented. Neither vcomp nor libgomp have corresponding interfaces.
+   If it is not compiled with clang, this value is still 200ms by default.
+   If you use vcomp or libgomp, you can use the environment variable OMP_WAIT_POLICY=PASSIVE to disable spin time. If you use simpleomp,
+   It's no need to set this parameter.
+```
+4. Limit the number of threads available in the openmp thread pool.
+```
+   Even if the number of openmp threads is reduced, the CPU occupancy rate may still be high. This is more common on servers with
+   particularly many CPU cores. 
+   This is because the waiting threads in the thread pool use a spinlock to busy-wait, which can be reducedby limiting the number of
+   threads available in the thread pool.
+
+   Generally, you can set the OMP_THREAD_LIMIT environment variable. simpleomp currently does not support this feature so it's no need to be set.
+   Note that this environment variable is only valid if it is set before the program starts.
+```
+5. Disable openmp completely
+```
+   If there is only one cpu core, or use the vulkan gpu acceleration, it is recommended to disable openmp, just specify -DNCNN_OPENMP=OFF
+   when compiling with cmake.
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/openmp-best-practice.zh.md
@ -0,0 +1,70 @@
+ncnn openmp 最佳实践
+
+### ncnn占用过多cpu资源
+
+   使用ncnn推理运算，cpu占用非常高甚至所有核心占用都接近100%。
+
+   如果还有其它线程或进程需要较多的cpu资源，运行速度下降严重。
+
+### cpu占用高的根本原因
+
+1. ncnn使用openmp API控制多线程加速推理计算。默认情况下，线程数等于cpu内核数。如果推理需要高频率运行，必然占用大部分
+   cpu资源。
+
+2. openmp内部维护一个线程池，线程池最大可用线程数等于cpu内核数。(核心过多时最大限制是15？）获取和归还线程时需要同步。
+
+   为了提高效率，几乎所有omp实现都使用了自旋锁同步(simpleomp除外)。自旋锁默认的spin time是200ms。因此一个线程被调度后，
+   需要忙等待最多200ms。
+
+### 为什么使用vulkan加速后cpu占用依然很高。
+
+1. 加载参数文件时也使用了openmp，这部分是在cpu上运行的。
+
+2. 显存上传前和下载后的 fp32 fp16转换是在cpu上执行的，这部分逻辑也使用了openmp。
+
+### 解决方法
+
+```
+1. 绑核
+```
+   如果使用有大小核cpu的设备，建议通过ncnn::set_cpu_powersave(int)绑定大核或小核，注意windows系统不支持绑核。顺便说一下，ncnn支持不同的模型运行在不同的核心。假设硬件平台有2个大核，4个小核，你想把netA运行在大核，netB运行在小核。
+   可以通过std::thread or pthread创建两个线程，运行如下代码：
+   
+   ```
+   void thread_1()
+   {
+      ncnn::set_cpu_powersave(2); // bind to big cores
+      netA.opt.num_threads = 2;
+   }
+
+   void thread_2()
+   {
+      ncnn::set_cpu_powersave(1); // bind to little cores
+      netB.opt.num_threads = 4;
+   }
+   ```
+
+```
+2. 使用更少的线程数。
+```
+   通过ncnn::set_omp_num_threads(int)或者net.opt.num_threads字段设置线程数为cpu内核数的一半或更小。如果使用clang的libomp，
+   建议线程数不超过8，如果使用其它omp库，建议线程数不超过4。
+```
+3. 减小openmp blocktime。
+```
+   可以修改ncnn::set_kmp_blocktime(int)或者修改net.opt.openmp_blocktime，这个参数是ncnn API设置的spin time，默认是20ms。
+   可以根据情况设置更小的值，或者直接改为0。
+
+   局限：目前只有clang的libomp库有实现，vcomp和libgomp都没有相应接口，如果不是使用clang编译的，这个值默认还是200ms。
+   如果使用vcomp或libgomp, 可以使用环境变量OMP_WAIT_POLICY=PASSIVE禁用spin time，如果使用simpleomp,不需要设置这个参数。
+```
+4. 限制openmp线程池可用线程数量。
+```
+   即使减小了openmp线程数量，cpu占用率仍然可能会很高。这在cpu核心特别多的服务器上比较常见。这是因为线程池中的等待线程使用
+   自旋锁忙等待，可以通过限制线程池可用线程数量减轻这种影响。
+
+   一般可以通过设置OMP_THREAD_LIMIT环境变量。simpleomp目前不支持这一特性，不需要设置。注意这个环境变量仅在程序启动前设置才有效。
+```
+5. 完全禁用openmp
+```
+   如果只有一个cpu核心，或者使用vulkan加速，建议关闭openmp, cmake编译时指定-DNCNN_OPENMP=OFF即可。
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/quantized-int8-inference.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/quantized-int8-inference.md
@ -0,0 +1,71 @@
+# Post Training Quantization Tools
+
+To support int8 model deployment on mobile devices,we provide the universal post training quantization tools which can convert the float32 model to int8 model.
+
+## User Guide
+
+Example with mobilenet, just need three steps.
+
+### 1. Optimize model
+
+```shell
+./ncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 0
+```
+
+### 2. Create the calibration table file
+
+We suggest that using the verification dataset for calibration, which is more than 5000 images.
+
+Some imagenet sample images here https://github.com/nihui/imagenet-sample-images
+
+```shell
+find images/ -type f > imagelist.txt
+./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist.txt mobilenet.table mean=[104,117,123] norm=[0.017,0.017,0.017] shape=[224,224,3] pixel=BGR thread=8 method=kl
+```
+
+* mean and norm are the values you passed to ```Mat::substract_mean_normalize()```
+* shape is the blob shape of your model, [w,h] or [w,h,c]
+
+>
+    * if w and h both are given, image will be resized to exactly size.
+    * if w and h both are zero or negative, image will not be resized.
+    * if only h is zero or negative, image's width will scaled resize to w, keeping aspect ratio.
+    * if only w is zero or negative, image's height will scaled resize to h
+
+* pixel is the pixel format of your model, image pixels will be converted to this type before ```Extractor::input()```
+* thread is the CPU thread count that could be used for parallel inference
+* method is the post training quantization algorithm, kl and aciq are currently supported
+
+If your model has multiple input nodes, you can use multiple list files and other parameters
+
+```shell
+./ncnn2table mobilenet-opt.param mobilenet-opt.bin imagelist-bgr.txt,imagelist-depth.txt mobilenet.table mean=[104,117,123],[128] norm=[0.017,0.017,0.017],[0.0078125] shape=[224,224,3],[224,224,1] pixel=BGR,GRAY thread=8 method=kl
+```
+
+### 3. Quantize model
+
+```shell
+./ncnn2int8 mobilenet-opt.param mobilenet-opt.bin mobilenet-int8.param mobilenet-int8.bin mobilenet.table
+```
+
+## use ncnn int8 inference
+
+the ncnn library would use int8 inference automatically, nothing changed in your code
+
+```cpp
+ncnn::Net mobilenet;
+mobilenet.load_param("mobilenet-int8.param");
+mobilenet.load_model("mobilenet-int8.bin");
+```
+
+## mixed precision inference
+
+Before quantize your model, comment the layer weight scale line in table file, then the layer will do the float32 inference
+
+```
+conv1_param_0 156.639840536
+```
+
+```
+#conv1_param_0 156.639840536
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.md
@ -0,0 +1,162 @@
+We use alexnet as an example
+
+### prepare caffe prototxt and model
+
+These files will usually generated when trained with caffe
+```
+train.prototxt
+deploy.prototxt
+snapshot_10000.caffemodel
+```
+deploy.prototxt and caffemodel file are enough for TEST phase
+
+alexnet deploy.prototxt can be downloaded here
+
+https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet
+
+alexnet caffemodel can be downloaded here
+
+http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel
+
+### convert to ncnn model
+
+Convert old caffe prototxt and caffemodel to new ones using tools in caffe
+
+because the ncnn convert tool needs the new format
+```
+upgrade_net_proto_text [old prototxt] [new prototxt]
+upgrade_net_proto_binary [old caffemodel] [new caffemodel]
+```
+
+Use Input layer as input, set N dim as 1 since only one image can be processed each time
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+Use caffe2ncnn tool to convert caffe model to ncnn model
+```
+caffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin
+```
+
+### strip visible string
+
+It is already enough for deploying with param and bin file only, but there are visible strings in param file, it may not be suitable to distribute plain neural network information in your APP.
+
+You can use ncnn2mem tool to convert plain model file to binary representation. It will generate alexnet.param.bin and two static array code files.
+```
+ncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h
+```
+
+### load model
+
+Load param and bin file, the easy way
+```cpp
+ncnn::Net net;
+net.load_param("alexnet.param");
+net.load_model("alexnet.bin");
+```
+Load binary param.bin and bin file, no visible strings included, suitable for bundled as APP resource
+```cpp
+ncnn::Net net;
+net.load_param_bin("alexnet.param.bin");
+net.load_model("alexnet.bin");
+```
+Load network and model from external memory, no visible strings included, no external resource files bundled, the whole model is hardcoded in your program
+
+You may use this way to load from android asset resource
+```cpp
+#include "alexnet.mem.h"
+ncnn::Net net;
+net.load_param(alexnet_param_bin);
+net.load_model(alexnet_bin);
+```
+You can choose either way to load model. Loading from external memory is zero-copy, which means you must keep your memory buffer during processing
+
+### unload model
+```cpp
+net.clear();
+```
+
+### input and output
+
+ncnn Mat is the data structure for input and output data
+
+Input image should be converted to Mat, and subtracted mean values and normalized when needed
+
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+ncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);
+
+const float mean_vals[3] = {104.f, 117.f, 123.f};
+in.substract_mean_normalize(mean_vals, 0);
+```
+Execute the network inference and retrieve the result
+```cpp
+#include "net.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input("data", in);
+ex.extract("prob", out);
+```
+If you load model with binary param.bin file, you should use the enum value in alexnet.id.h file instead of the blob name
+```cpp
+#include "net.h"
+#include "alexnet.id.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input(alexnet_param_id::BLOB_data, in);
+ex.extract(alexnet_param_id::BLOB_prob, out);
+```
+Read the data in the output Mat. Iterate data to get all classification scores.
+```cpp
+ncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);
+std::vector<float> scores;
+scores.resize(out_flatterned.w);
+for (int j=0; j<out_flatterned.w; j++)
+{
+    scores[j] = out_flatterned[j];
+}
+```
+
+### some tricks
+
+Set multithreading thread number with Extractor
+```cpp
+ex.set_num_threads(4);
+```
+Convert image colorspace and resize image with Mat convenient function, these functions are well optimized
+
+Support RGB2GRAY GRAY2RGB RGB2BGR etc, support scale up and scale down
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+int target_width = 227;// target resized width
+int target_height = 227;// target resized height
+ncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);
+```
+You can concat multiple model files into one, and load this single file from FILE* interface.
+
+It should ease the distribution of param and model files.
+
+> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin
+
+```cpp
+#include "net.h"
+FILE* fp = fopen("alexnet-all.bin", "rb");
+net.load_param_bin(fp);
+net.load_model(fp);
+fclose(fp);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.zh.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-alexnet.zh.md
@ -0,0 +1,149 @@
+首先，非常感谢大家对 ncnn 组件的关注
+为了方便大家使用 ncnn 组件，up主特意写了这篇使用指北，以烂大街的 alexnet 作为例子
+
+
+### 准备caffe网络和模型
+
+caffe 的网络和模型通常是搞深度学习的研究者训练出来的，一般来说训练完会有
+```
+train.prototxt
+deploy.prototxt
+snapshot_10000.caffemodel
+```
+部署的时候只需要 TEST 过程，所以有 deploy.prototxt 和 caffemodel 就足够了
+
+alexnet 的 deploy.prototxt 可以在这里下载
+https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet
+
+alexnet 的 caffemodel 可以在这里下载
+http://dl.caffe.berkeleyvision.org/bvlc_alexnet.caffemodel
+
+### 转换ncnn网络和模型
+
+caffe 自带了工具可以把老版本的 caffe 网络和模型转换为新版（ncnn的工具只认识新版
+```
+upgrade_net_proto_text [老prototxt] [新prototxt]
+upgrade_net_proto_binary [老caffemodel] [新caffemodel]
+```
+输入层改用 Input，因为每次只需要做一个图片，所以第一个 dim 设为 1
+```
+layer {
+  name: "data"
+  type: "Input"
+  top: "data"
+  input_param { shape: { dim: 1 dim: 3 dim: 227 dim: 227 } }
+}
+```
+使用 caffe2ncnn 工具转换为 ncnn 的网络描述和模型
+```
+caffe2ncnn deploy.prototxt bvlc_alexnet.caffemodel alexnet.param alexnet.bin
+```
+### 去除可见字符串
+
+有 param 和 bin 文件其实已经可以用了，但是 param 描述文件是明文的，如果放在 APP 分发出去容易被窥探到网络结构（说得好像不明文就看不到一样
+使用 ncnn2mem 工具转换为二进制描述文件和内存模型，生成 alexnet.param.bin 和两个静态数组的代码文件
+```
+ncnn2mem alexnet.param alexnet.bin alexnet.id.h alexnet.mem.h
+```
+### 加载模型
+
+直接加载 param 和 bin，适合快速验证效果使用
+```cpp
+ncnn::Net net;
+net.load_param("alexnet.param");
+net.load_model("alexnet.bin");
+```
+加载二进制的 param.bin 和 bin，没有可见字符串，适合 APP 分发模型资源
+```cpp
+ncnn::Net net;
+net.load_param_bin("alexnet.param.bin");
+net.load_model("alexnet.bin");
+```
+从内存引用加载网络和模型，没有可见字符串，模型数据全在代码里头，没有任何外部文件
+另外，android apk 打包的资源文件读出来也是内存块
+```cpp
+#include "alexnet.mem.h"
+ncnn::Net net;
+net.load_param(alexnet_param_bin);
+net.load_model(alexnet_bin);
+```
+以上三种都可以加载模型，其中内存引用方式加载是 zero-copy 的，所以使用 net 模型的来源内存块必须存在
+
+### 卸载模型
+```cpp
+net.clear();
+```
+
+### 输入和输出
+
+ncnn 用自己的数据结构 Mat 来存放输入和输出数据
+输入图像的数据要转换为 Mat，依需要减去均值和乘系数
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+ncnn::Mat in = ncnn::Mat::from_pixels(rgbdata, ncnn::Mat::PIXEL_RGB, w, h);
+
+const float mean_vals[3] = {104.f, 117.f, 123.f};
+in.substract_mean_normalize(mean_vals, 0);
+```
+执行前向网络，获得计算结果
+```cpp
+#include "net.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input("data", in);
+ex.extract("prob", out);
+```
+如果是二进制的 param.bin 方式，没有可见字符串，利用 alexnet.id.h 的枚举来代替 blob 的名字
+```cpp
+#include "net.h"
+#include "alexnet.id.h"
+ncnn::Mat in;// input blob as above
+ncnn::Mat out;
+ncnn::Extractor ex = net.create_extractor();
+ex.set_light_mode(true);
+ex.input(alexnet_param_id::BLOB_data, in);
+ex.extract(alexnet_param_id::BLOB_prob, out);
+```
+获取 Mat 中的输出数据，Mat 内部的数据通常是三维的，c / h / w，遍历所有获得全部分类的分数
+```cpp
+ncnn::Mat out_flatterned = out.reshape(out.w * out.h * out.c);
+std::vector<float> scores;
+scores.resize(out_flatterned.w);
+for (int j=0; j<out_flatterned.w; j++)
+{
+    scores[j] = out_flatterned[j];
+}
+```
+### 某些使用技巧
+
+Extractor 有个多线程加速的开关，设置线程数能加快计算
+```cpp
+ex.set_num_threads(4);
+```
+Mat 转换图像的时候可以顺便转换颜色和缩放大小，这些顺带的操作也是有优化的
+支持 RGB2GRAY GRAY2RGB RGB2BGR 等常用转换，支持缩小和放大
+```cpp
+#include "mat.h"
+unsigned char* rgbdata;// data pointer to RGB image pixels
+int w;// image width
+int h;// image height
+int target_width = 227;// target resized width
+int target_height = 227;// target resized height
+ncnn::Mat in = ncnn::Mat::from_pixels_resize(rgbdata, ncnn::Mat::PIXEL_RGB2GRAY, w, h, target_width, target_height);
+```
+Net 有从 FILE* 文件描述加载的接口，可以利用这点把多个网络和模型文件合并为一个，分发时能方便些，内存引用就无所谓了
+
+> $ cat alexnet.param.bin alexnet.bin > alexnet-all.bin
+
+```cpp
+#include "net.h"
+FILE* fp = fopen("alexnet-all.bin", "rb");
+net.load_param_bin(fp);
+net.load_model(fp);
+fclose(fp);
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-opencv.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-opencv.md
@ -0,0 +1,135 @@
+### opencv to ncnn
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + swap RGB/BGR
+
+```cpp
+// cv::Mat a(h, w, CV_8UC3);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB, a.cols, a.rows);
+```
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 3 channel + keep RGB/BGR order
+
+```cpp
+// cv::Mat a(h, w, CV_8UC3);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_RGB, a.cols, a.rows);
+```
+
+* cv::Mat CV_8UC3 -> ncnn::Mat 1 channel + do RGB2GRAY/BGR2GRAY
+
+```cpp
+// cv::Mat rgb(h, w, CV_8UC3);
+ncnn::Mat inrgb = ncnn::Mat::from_pixels(rgb.data, ncnn::Mat::PIXEL_RGB2GRAY, rgb.cols, rgb.rows);
+
+// cv::Mat bgr(h, w, CV_8UC3);
+ncnn::Mat inbgr = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2GRAY, bgr.cols, bgr.rows);
+```
+
+* cv::Mat CV_8UC1 -> ncnn::Mat 1 channel
+
+```cpp
+// cv::Mat a(h, w, CV_8UC1);
+ncnn::Mat in = ncnn::Mat::from_pixels(a.data, ncnn::Mat::PIXEL_GRAY, a.cols, a.rows);
+```
+
+* cv::Mat CV_32FC1 -> ncnn::Mat 1 channel
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// cv::Mat a(h, w, CV_32FC1);
+ncnn::Mat in(a.cols, a.rows, 1, (void*)a.data);
+in = in.clone();
+```
+
+* cv::Mat CV_32FC3 -> ncnn::Mat 3 channel
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// cv::Mat a(h, w, CV_32FC3);
+ncnn::Mat in_pack3(a.cols, a.rows, 1, (void*)a.data, (size_t)4u * 3, 3);
+ncnn::Mat in;
+ncnn::convert_packing(in_pack3, in, 1);
+```
+
+* std::vector < cv::Mat > + CV_32FC1 -> ncnn::Mat multiple channels
+
+  * **You could construct ncnn::Mat and fill data into it directly to avoid data copy**
+
+```cpp
+// std::vector<cv::Mat> a(channels, cv::Mat(h, w, CV_32FC1));
+int channels = a.size();
+ncnn::Mat in(a[0].cols, a[0].rows, channels);
+for (int p=0; p<in.c; p++)
+{
+    memcpy(in.channel(p), (const uchar*)a[p].data, in.w * in.h * sizeof(float));
+}
+```
+
+### ncnn to opencv
+
+* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + swap RGB/BGR
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+cv::Mat a(in.h, in.w, CV_8UC3);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_BGR2RGB);
+```
+
+* ncnn::Mat 3 channel -> cv::Mat CV_8UC3 + keep RGB/BGR order
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+cv::Mat a(in.h, in.w, CV_8UC3);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_RGB);
+```
+
+* ncnn::Mat 1 channel -> cv::Mat CV_8UC1
+
+  * **You may need to call in.substract_mean_normalize() first to scale values from 0..1 to 0..255**
+
+```cpp
+// ncnn::Mat in(w, h, 1);
+cv::Mat a(in.h, in.w, CV_8UC1);
+in.to_pixels(a.data, ncnn::Mat::PIXEL_GRAY);
+```
+
+* ncnn::Mat 1 channel -> cv::Mat CV_32FC1
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in;
+cv::Mat a(in.h, in.w, CV_32FC1);
+memcpy((uchar*)a.data, in.data, in.w * in.h * sizeof(float));
+```
+
+* ncnn::Mat 3 channel -> cv::Mat CV_32FC3
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in(w, h, 3);
+ncnn::Mat in_pack3;
+ncnn::convert_packing(in, in_pack3, 3);
+cv::Mat a(in.h, in.w, CV_32FC3);
+memcpy((uchar*)a.data, in_pack3.data, in.w * in.h * 3 * sizeof(float));
+```
+
+* ncnn::Mat multiple channels -> std::vector < cv::Mat > + CV_32FC1
+
+  * **You could consume or manipulate ncnn::Mat data directly to avoid data copy**
+
+```cpp
+// ncnn::Mat in(w, h, channels);
+std::vector<cv::Mat> a(in.c);
+for (int p=0; p<in.c; p++)
+{
+    a[p] = cv::Mat(in.h, in.w, CV_32FC1);
+    memcpy((uchar*)a[p].data, in.channel(p), in.w * in.h * sizeof(float));
+}
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-own-project.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-own-project.md
@ -0,0 +1,48 @@
+### use ncnn with own project
+
+After building ncnn, there is one or more library files generated. Consider integrating ncnn into your own project, you may use ncnn's installating provided cmake config file, or by manually specify library path(s).
+
+**with cmake**
+
+Ensure your project is built by cmake. Then in your project's CMakeLists.txt, add these lines:
+
+```cmake
+set(ncnn_DIR "<ncnn_install_dir>/lib/cmake/ncnn" CACHE PATH "Directory that contains ncnnConfig.cmake")
+find_package(ncnn REQUIRED)
+target_link_libraries(my_target ncnn)
+```
+After this, both the header file search path ("including directories") and library paths are configured automatically, including vulkan related dependencies.
+
+Note: you have to change `<ncnn_install_dir>` to your machine's directory, it is the directory that contains `ncnnConfig.cmake`.
+
+For the prebuilt ncnn release packages, ncnnConfig is located in:
+- for `ncnn-YYYYMMDD-windows-vs2019`, it is `lib/cmake/ncnn`
+- for `ncnn-YYYYMMDD-android-vulkan`, it is `${ANDROID_ABI}/lib/cmake/ncnn` (`${ANDROID_ABI}` is defined in NDK's cmake toolchain file)
+- other prebuilt release packages are with similar condition
+
+**manually specify**
+
+You may also manually specify ncnn library path and including directory. Note that if you use ncnn with vulkan, it is also required to specify vulkan related dependencies.
+
+For example, on Visual Studio debug mode with vulkan required, the lib paths are:
+```
+E:\github\ncnn\build\vs2019-x64\install\lib\ncnnd.lib
+E:\lib\VulkanSDK\1.2.148.0\Lib\vulkan-1.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\SPIRVd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\glslangd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\MachineIndependentd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OGLCompilerd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OSDependentd.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\GenericCodeGend.lib
+```
+And for its release mode, lib paths are:
+```
+E:\github\ncnn\build\vs2019-x64\install\lib\ncnn.lib
+E:\lib\VulkanSDK\1.2.148.0\Lib\vulkan-1.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\SPIRV.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\glslang.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\MachineIndependent.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OGLCompiler.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\OSDependent.lib
+E:\github\ncnn\build\vs2019-x64\install\lib\GenericCodeGen.lib
+```
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-pytorch-or-onnx.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnn-with-pytorch-or-onnx.md
@ -0,0 +1,55 @@
+Here is a practical guide for converting pytorch model to ncnn
+
+resnet18 is used as the example
+
+## pytorch to onnx
+
+The official pytorch tutorial for exporting onnx model
+
+https://pytorch.org/tutorials/advanced/super_resolution_with_caffe2.html
+
+```python
+import torch
+import torchvision
+import torch.onnx
+
+# An instance of your model
+model = torchvision.models.resnet18()
+
+# An example input you would normally provide to your model's forward() method
+x = torch.rand(1, 3, 224, 224)
+
+# Export the model
+torch_out = torch.onnx._export(model, x, "resnet18.onnx", export_params=True)
+```
+
+## simplify onnx model
+
+The exported resnet18.onnx model may contains many redundant operators such as Shape, Gather and Unsqueeze that is not supported in ncnn
+
+```
+Shape not supported yet!
+Gather not supported yet!
+  # axis=0
+Unsqueeze not supported yet!
+  # axes 7
+Unsqueeze not supported yet!
+  # axes 7
+```
+
+Fortunately, daquexian developed a handy tool to eliminate them. cheers!
+
+https://github.com/daquexian/onnx-simplifier
+
+```
+python3 -m onnxsim resnet18.onnx resnet18-sim.onnx
+```
+
+## onnx to ncnn
+
+Finally, you can convert the model to ncnn using tools/onnx2ncnn
+
+```
+onnx2ncnn resnet18-sim.onnx resnet18.param resnet18.bin
+```
+
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnnoptimize-to-optimize-model.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/use-ncnnoptimize-to-optimize-model.md
@ -0,0 +1,25 @@
+
+the typical usage
+```
+ncnnoptimize mobilenet.param mobilenet.bin mobilenet-opt.param mobilenet-opt.bin 65536 
+```
+
+operator fusion
+* batchnorm - scale
+* convolution - batchnorm
+* convolutiondepthwise - batchnorm
+* deconvolution - batchnorm
+* deconvolutiondepthwise - batchnorm
+* innerproduct - batchnorm
+* convolution - relu
+* convolutiondepthwise - relu
+* deconvolution - relu
+* deconvolutiondepthwise - relu
+* innerproduct - relu
+
+eliminate noop operator
+* innerproduct - dropout
+* flatten after global pooling
+
+prefer better operator
+* replace convolution with innerproduct after global pooling
--- a/3rdparty/ncnn/docs/how-to-use-and-FAQ/vulkan-notes.md
+++ b/3rdparty/ncnn/docs/how-to-use-and-FAQ/vulkan-notes.md
@ -0,0 +1,173 @@
+## supported platform
+
+* Y = known work
+* ? = shall work, not confirmed
+* / = not applied
+
+|    |windows|linux|android|mac|ios|
+|---|---|---|---|---|---|
+|intel|Y|Y|?|?|/|
+|amd|Y|Y|/|?|/|
+|nvidia|Y|Y|?|/|/|
+|qcom|/|/|Y|/|/|
+|apple|/|/|/|?|Y|
+|arm|/|?|?|/|/|
+
+## enable vulkan compute support
+```
+$ sudo dnf install vulkan-devel
+$ cmake -DNCNN_VULKAN=ON ..
+```
+
+## enable vulkan compute inference
+```cpp
+ncnn::Net net;
+net.opt.use_vulkan_compute = 1;
+```
+
+## proper allocator usage
+```cpp
+ncnn::VkAllocator* blob_vkallocator = vkdev.acquire_blob_allocator();
+ncnn::VkAllocator* staging_vkallocator = vkdev.acquire_blob_allocator();
+
+net.opt.blob_vkallocator = blob_vkallocator;
+net.opt.workspace_vkallocator = blob_vkallocator;
+net.opt.staging_vkallocator = staging_vkallocator;
+
+// ....
+
+// after inference
+vkdev.reclaim_blob_allocator(blob_vkallocator);
+vkdev.reclaim_staging_allocator(staging_vkallocator);
+```
+
+## select gpu device
+```cpp
+// get gpu count
+int gpu_count = ncnn::get_gpu_count();
+
+// set specified vulkan device before loading param and model
+net.set_vulkan_device(0); // use device-0
+net.set_vulkan_device(1); // use device-1
+```
+
+## zero-copy on unified memory device
+```cpp
+ncnn::VkMat blob_gpu;
+ncnn::Mat mapped = blob_gpu.mapped();
+
+// use mapped.data directly
+```
+
+## hybrid cpu/gpu inference
+```cpp
+ncnn::Extractor ex_cpu = net.create_extractor();
+ncnn::Extractor ex_gpu = net.create_extractor();
+ex_cpu.set_vulkan_compute(false);
+ex_gpu.set_vulkan_compute(true);
+
+#pragma omp parallel sections
+{
+    #pragma omp section
+    {
+        ex_cpu.input();
+        ex_cpu.extract();
+    }
+    #pragma omp section
+    {
+        ex_gpu.input();
+        ex_gpu.extract();
+    }
+}
+```
+
+## zero-copy gpu inference chaining
+```cpp
+ncnn::Extractor ex1 = net1.create_extractor();
+ncnn::Extractor ex2 = net2.create_extractor();
+
+ncnn::VkCompute cmd(&vkdev);
+
+ncnn::VkMat conv1;
+ncnn::VkMat conv2;
+ncnn::VkMat conv3;
+
+ex1.input("conv1", conv1);
+ex1.extract("conv2", conv2, cmd);
+
+ex2.input("conv2", conv2);
+ex2.extract("conv3", conv3, cmd);
+
+cmd.submit();
+
+cmd.wait();
+
+```
+
+## batch inference
+```cpp
+int max_batch_size = vkdev->info.compute_queue_count;
+
+ncnn::Mat inputs[1000];
+ncnn::Mat outputs[1000];
+
+#pragma omp parallel for num_threads(max_batch_size)
+for (int i=0; i<1000; i++)
+{
+    ncnn::Extractor ex = net1.create_extractor();
+    ex.input("data", inputs[i]);
+    ex.extract("prob", outputs[i]);
+}
+```
+
+## control storage and arithmetic precision
+
+disable all lower-precision optimizations, get full fp32 precision
+
+```cpp
+ncnn::Net net;
+net.opt.use_fp16_packed = false;
+net.opt.use_fp16_storage = false;
+net.opt.use_fp16_arithmetic = false;
+net.opt.use_int8_storage = false;
+net.opt.use_int8_arithmetic = false;
+```
+
+## debugging tips
+```cpp
+#define ENABLE_VALIDATION_LAYER 1 // modify to 1 in gpu.cpp
+```
+
+## add vulkan compute support to layer
+1. add vulkan shader in src/layer/shader/
+
+2. upload model weight data in Layer::upload_model()
+
+3. setup pipeline in Layer::create_pipeline()
+
+4. destroy pipeline in Layer::destroy_pipeline()
+
+5. record command in Layer::forward()
+
+## add optimized shader path
+1. add vulkan shader in src/layer/shader/ named XXX_abc.comp
+
+2. create pipeline with "XXX_abc"
+
+3. record command using XXX_abc pipeline
+
+## low-level op api
+1. create layer
+
+2. load param and load model
+
+3. upload model
+
+4. create pipeline
+
+5. new command
+
+6. record
+
+7. submit and wait
+