Training and Test¶
Training¶
Training with your PC¶
You can use tools/train.py
to train a model on a single machine with a CPU and optionally a GPU.
Here is the full usage of the script:
python tools/train.py ${CONFIG_FILE} [ARGS]
Note
By default, MMAction2 prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES
or set it to -1 to make GPU invisible to the program.
CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]
ARGS | Description |
---|---|
CONFIG_FILE |
The path to the config file. |
--work-dir WORK_DIR |
The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under ./work_dirs . |
--resume [RESUME] |
Resume training. If a path is specified, resume from it, while if not specified, try to auto resume from the latest checkpoint. |
--amp |
Enable automatic-mixed-precision training. |
--no-validate |
Not suggested. Disable checkpoint evaluation during training. |
--auto-scale-lr |
Auto scale the learning rate according to the actual batch size and the original batch size. |
--seed |
Random seed. |
--diff-rank-seed |
Whether or not set different seeds for different ranks. |
--deterministic |
Whether to set deterministic options for CUDNN backend. |
--cfg-options CFG_OPTIONS |
Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b . The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" . Note that the quotation marks are necessary and that no white space is allowed. |
--launcher {none,pytorch,slurm,mpi} |
Options for job launcher. Defaults to none . |
Training with multiple GPUs¶
We provide a shell script to start a multi-GPUs task with torch.distributed.launch
.
bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]
ARGS | Description |
---|---|
CONFIG |
The path to the config file. |
GPUS |
The number of GPUs to be used. |
[PYARGS] |
The other optional arguments of tools/train.py , see here. |
You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the following command:
PORT=29666 bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]
If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different port and visible devices.
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]
Training with multiple machines¶
Multiple machines in the same network¶
If you launch a training job with multiple machines connected with ethernet, you can run the following commands:
On the first machine:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
On the second machine:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS
The following extra environment variables need to be specified to train or test models with multiple machines:
ENV_VARS | Description |
---|---|
NNODES |
The total number of machines. Defaults to 1. |
NODE_RANK |
The index of the local machine. Defaults to 0. |
PORT |
The communication port, it should be the same in all machines. Defaults to 29500. |
MASTER_ADDR |
The IP address of the master machine, it should be the same in all machines. Defaults to 127.0.0.1 . |
Usually it is slow if you do not have high speed networking like InfiniBand.
Multiple machines managed with slurm¶
If you run MMAction2 on a cluster managed with slurm, you can use the script slurm_train.sh
.
[ENV_VARS] bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} [PY_ARGS]
Here are the arguments description of the script.
ARGS | Description |
---|---|
PARTITION |
The partition to use in your cluster. |
JOB_NAME |
The name of your job, you can name it as you like. |
CONFIG |
The path to the config file. |
[PYARGS] |
The other optional arguments of tools/train.py , see here. |
Here are the environment variables can be used to configure the slurm job.
ENV_VARS | Description |
---|---|
GPUS |
The number of GPUs to be used. Defaults to 8. |
GPUS_PER_NODE |
The number of GPUs to be allocated per node. Defaults to 8. |
CPUS_PER_TASK |
The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. |
SRUN_ARGS |
The other arguments of srun . Available options can be found here. |
Test¶
Test with your PC¶
You can use tools/test.py
to test a model on a single machine with a CPU and optionally a GPU.
Here is the full usage of the script:
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
Note
By default, MMAction2 prefers GPU to CPU. If you want to test a model on CPU, please empty CUDA_VISIBLE_DEVICES
or set it to -1 to make GPU invisible to the program.
CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]
ARGS | Description |
---|---|
CONFIG_FILE |
The path to the config file. |
CHECKPOINT_FILE |
The path to the checkpoint file (It can be a http link) |
--work-dir WORK_DIR |
The directory to save the file containing evaluation metrics. Defaults to a folder with the same name of the config file under ./work_dirs . |
--dump DUMP |
The path to dump all outputs of the model for offline evaluation. |
--cfg-options CFG_OPTIONS |
Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b . The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" . Note that the quotation marks are necessary and that no white space is allowed. |
--show-dir SHOW_DIR |
The directory to save the result visualization images. |
--show |
Visualize the prediction result in a window. |
--interval INTERVAL |
The interval of samples to visualize. Defaults to 1. |
--wait-time WAIT_TIME |
The display time of every window (in seconds). Defaults to 2. |
--launcher {none,pytorch,slurm,mpi} |
Options for job launcher. Defaults to none . |
Test with multiple GPUs¶
We provide a shell script to start a multi-GPUs task with torch.distributed.launch
.
bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]
ARGS | Description |
---|---|
CONFIG |
The path to the config file. |
CHECKPOINT |
The path to the checkpoint file (It can be a http link) |
GPUS |
The number of GPUs to be used. |
[PYARGS] |
The other optional arguments of tools/test.py , see here. |
You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the following command:
PORT=29666 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]
If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices.
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]
Test with multiple machines¶
Multiple machines in the same network¶
If you launch a test job with multiple machines connected with ethernet, you can run the following commands:
On the first machine:
NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS
On the second machine:
NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS
Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:
ENV_VARS | Description |
---|---|
NNODES |
The total number of machines. Defaults to 1. |
NODE_RANK |
The index of the local machine. Defaults to 0. |
PORT |
The communication port, it should be the same in all machines. Defaults to 29500. |
MASTER_ADDR |
The IP address of the master machine, it should be the same in all machines. Defaults to 127.0.0.1 . |
Usually it is slow if you do not have high speed networking like InfiniBand.
Multiple machines managed with slurm¶
If you run MMAction2 on a cluster managed with slurm, you can use the script slurm_test.sh
.
[ENV_VARS] bash tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${CHECKPOINT} [PY_ARGS]
Here are the arguments description of the script.
ARGS | Description |
---|---|
PARTITION |
The partition to use in your cluster. |
JOB_NAME |
The name of your job, you can name it as you like. |
CONFIG |
The path to the config file. |
CHECKPOINT |
The path to the checkpoint file (It can be a http link) |
[PYARGS] |
The other optional arguments of tools/test.py , see here. |
Here are the environment variables can be used to configure the slurm job.
ENV_VARS | Description |
---|---|
GPUS |
The number of GPUs to be used. Defaults to 8. |
GPUS_PER_NODE |
The number of GPUs to be allocated per node. Defaults to 8. |
CPUS_PER_TASK |
The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5. |
SRUN_ARGS |
The other arguments of srun . Available options can be found here. |