什么是TAO以及如何安装和使用TAO
NVIDIA的TAO(Train, Adapt, Optimize)Toolkit以前叫TLT(Tranfer Learning Toolkit),即迁移学习工具,关于什么是迁移学习,网上资料大把,这里只说TAO,故不多说迁移学习。只要是干过模型的,看看NVIDIA官网上给出的这张关于TAO的架构图,也能大概明白了七八分:近来琢磨熟悉TAO的使用并写PPT给组内培训,所以对TAO的相关文档仔细过了
NVIDIA的TAO(Train, Adapt, Optimize)Toolkit以前叫TLT(Tranfer Learning Toolkit),即迁移学习工具,关于什么是迁移学习,网上资料大把,这里只说TAO,故不多说迁移学习。只要是干过模型的,看看NVIDIA官网上给出的这张关于TAO的架构图,也能大概明白了七八分:
近来琢磨熟悉TAO的使用并写PPT给组内培训,所以对TAO的相关文档仔细过了一遍,也在不连贯的文档说明指导下琢磨了从TAO的环境安装到训练数据及预训练模型的准备和模型训练再到模型的导出成etlt文件和转换成TensorRT engine文件的全过程,感觉TAO配套的说明文档有点混乱不连贯,有部分内容还有点陈旧了,所以按照自己走通了的路写了个PPT,记录了全部过程。
TAO目前有三个docker image,参见相关说明:
TAO Toolkit for Conv AI | NVIDIA NGCDocker container with workflows implemented in PyTorch as part of the Train Adapt Optimize (TAO) Toolkit.https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/containers/tao-toolkit-pytTAO Toolkit for Language Model (Conv AI) | NVIDIA NGCDocker containing implementations to train a Language Model with TAO Toolkithttps://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/containers/tao-toolkit-lmTAO的预训练模型放在几个docker image里,我们可以看上面三个docker image的说明文档仔细看里面的模型列表,例如:
寻找自己需要的合适模型来在它的基础上做迁移学习。第一个docker image里收集的是CV相关的模型,安装的训练框架是tensorflow,第二个docker image里收集的是语音和文本相关的模型,安装的训练框架是pytorch,第三个里面是语言模型,这些docker image并不需要手工拉取并创建container,当在第一次执行tao <network> train ... 之类的命令时会自动拉取对应的docker image到本地并创建容器(创建容器时对容器内的路径和host上的路径映射都是定义在~/.tao_mounts.json里,所以需要手工创建这个文件,后面会提到),容器应该是用完即删,训练数据和预训练模型以及训练的结果模型等文件都是通过路径映射机制保存到host上,知道了这些并且对docker比较熟悉,对后面的步骤就能很快明白为何要那样做。
为了偷懒省点能量,下面很多步骤说明我就直接抠NVIDIA官网文档里的内容或者我自己写的PPT的内容来直接粘贴。
1.安装TAO
硬件软件要求:
安装TAO之前需要做的:
Login into nvcr.io docker:
root@ubuntu-rtx3090:/data/workspace# docker login nvcr.io
Username: $oauthtoken
Password: <input your API Key>
Once you input your API key and press ENTER, you will see the following messages:
WARNING! Your password will be stored unencrypted in /root/.docker/config.json.
Configure a credential helper to remove this warning. See
https://docs.docker.com/engine/reference/commandline/login/#credentials-store
Login Succeeded
我们该安装哪个版本的TAO取决于我们使用的DeepStream的版本,假如由TAO里导出的模型需要在Deepstream中使用的话:
We create ~/.tao_mounts.json by the following steps on host (not in tao-* container!):
Set four environmental variables with your own paths and key, e.g:
export HOST_DATA_DIR=/data/workspace/tao-experiments/data
export HOST_SPECS_DIR=/data/workspace/tao-experiments/specs
export HOST_RESULTS_DIR=/data/workspace/tao-experiments/results
export KEY=nvidia_tao
For permanent use, add the above lines into ~/.bashrc
and execute ‘source ~/.bashrc’
2) Create the directories to make sure the paths exist:
mkdir -p $HOST_DATA_DIR
mkdir -p $HOST_SPECS_DIR
mkdir -p $HOST_RESULTS_DIR
安装一定步骤从官网上下载训练用的脚本和预训练模型:
Example Jupyter notebooks for all the tasks that are supported in TAO Toolkit are available in NGC resources. TAO Toolkit provides sample workflows for Computer Vision and Conversational AI.
As of now, we only care CV models, to run the available CV examples, download this sample resource by using the following commands:
wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/tao/cv_samples/versions/v1.3.0/zip -O cv_samples_v1.3.0.zip
unzip -u cv_samples_v1.3.0.zip -d ./cv_samples_v1.3.0 && rm -rf cv_samples_v1.3.0.zip && cd ./cv_samples_v1.3.0
You will see multi directories, under each directory, there is a *. ipynb file for training model in notebook.
# if you want to train actionrecognitionnet, copy the specs scripts for it to $HOST_SPECS_DIR:
cp -r action_recognition_net/specs/* $HOST_SPECS_DIR
List the files under $HOST_SPECS_DIR, you will see these file :
The actual steps are :
#download and configure ngc cli
wget -O ngccli_linux.zip https://ngc.nvidia.com/downloads/ngccli_linux.zip && unzip -o ngccli_linux.zip && chmod u+x ngc
#check if the file downloaded is OK or not:
md5sum -c ngc.md5
echo "export PATH=\"\$PATH:$(pwd)\"" >> ~/.bashsrc && source ~/.bashsrc
## browse the the model list for Transfer Learning:
#ngc registry model list nvidia/tao/*
# download model by command:
ngc registry model download-version <Repository:version> -dest <path_to_download_dir>
#e.g. download model to $HOST_RESULTS_DIR/pretrained/:
ngc registry model download-version nvidia/tao/actionrecognitionnet:trainable_v1.0 --dest $HOST_RESULTS_DIR/pretrained/
Now, preparation is done, we can run commands in terminal to do model training/testing or inference, or launch jupyter notebook under the directory cv_samples_v1.3.0 and run the commands in the .ipynb file.
Here I only present how to use jupyter notebook:
#run jupyter
jupyter notebook --ip 0.0.0.0 --port 8888 --allow-root
To access the notebook, open this file in a browser:
file:///root/.local/share/jupyter/runtime/nbserver-8943-open.html
Or copy and paste one of these URLs:
http://ubuntu-rtx3090:8888/?token=051c9f3f9bdffa585b75b22509b11714cd7c1f7670b4ac40
or http://127.0.0.1:8888/?token=051c9f3f9bdffa585b75b22509b11714cd7c1f7670b4ac40
copy the URL and open the page in browser, you will see:
2.训练数据准备
以action recognition使用的hmdb51 dataset为例:
3.使用TAO训练和测试模型以及模型推理
4.导出TAO模型以及模型的集成与转换
直接将TAO导出的etlt文件集成进DeepStream:
安装TAO Converter并使用TAO Converter将模型的etlt文件转换成TensorRT engine文件然后给Deepstream使用:
安装Tao Converter参见: https://docs.nvidia.com/tao/tao-toolkit/text/tensorrt.html#id2
TAO Converter Support Matrix for x86 | ||
CUDA/CUDNN | TensorRT | Platform |
10.2/8.0 | 7.2 | |
11.0/8.0 | 7.2 | |
11.1/8.0 | 7.2 | |
11.2/8.0 | 7.2 | |
10.2/8.0 | 7.1 | |
11.0/8.0 | 7.1 | |
11.3/8.1 | 8.0 |
TAO Converter Support Matrix for Jetson | ||
Platform | JetPack Version | Availability |
Jetson | 4.4 | |
Jetson | 4.5 | |
Jetson | 4.6 | |
Jetson + dGPU | 4.5 |
我们需要在Jetson板子上部署模型的话,就需要安装Jetson版的TAO Converter:
然后使用Tao Converter将etlt文件转换成TensorRT engine文件:
将actionrecognitionnet模型的etlt文件转换成engine的过程中出了错,等解决这个问题后再补充完善这最后一步,初步判断是低一点版本的TAO converter在使用-p 参数时不支持5维,待节后使用最新版的Jetpack4.6(使用的TensorRT8)里安装最新的TAO后验证这个猜测的正确性。
这个问题我调查后已经明确了,只有Jetpack4.6版的tao-converter的-p才支持5维,JP4.5或以下版本的tao-converter都不支持,这个问题很好fix,预计NVIDIA会在后续新版中推出Jetson上支持JP4.4和4.5的fixed版本。
开放原子开发者工作坊旨在鼓励更多人参与开源活动,与志同道合的开发者们相互交流开发经验、分享开发心得、获取前沿技术趋势。工作坊有多种形式的开发者活动,如meetup、训练营等,主打技术交流,干货满满,真诚地邀请各位开发者共同参与!
更多推荐
所有评论(0)