Training and Test¶

Training and Test
- Training
- Test

Training¶

Training with your PC¶

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]

Note

By default, MMAction2 prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/train.py ${CONFIG_FILE} [ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`--work-dir WORK_DIR`	The target folder to save logs and checkpoints. Defaults to a folder with the same name of the config file under `./work_dirs`.
`--resume [RESUME]`	Resume training. If a path is specified, resume from it, while if not specified, try to auto resume from the latest checkpoint.
`--amp`	Enable automatic-mixed-precision training.
`--no-validate`	Not suggested. Disable checkpoint evaluation during training.
`--auto-scale-lr`	Auto scale the learning rate according to the actual batch size and the original batch size.
`--seed`	Random seed.
`--diff-rank-seed`	Whether or not set different seeds for different ranks.
`--deterministic`	Whether to set deterministic options for CUDNN backend.
`--cfg-options CFG_OPTIONS`	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed.
`--launcher {none,pytorch,slurm,mpi}`	Options for job launcher. Defaults to `none`.

Training with multiple GPUs¶

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]

ARGS	Description
`CONFIG`	The path to the config file.
`GPUS`	The number of GPUs to be used.
`[PYARGS]`	The other optional arguments of `tools/train.py`, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the following command:

PORT=29666 bash tools/dist_train.sh ${CONFIG} ${GPUS} [PY_ARGS]

If you want to startup multiple training jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_train.sh ${CONFIG} 4 [PY_ARGS]

Training with multiple machines¶

Multiple machines in the same network¶

If you launch a training job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_train.sh $CONFIG $GPUS

The following extra environment variables need to be specified to train or test models with multiple machines:

ENV_VARS	Description
`NNODES`	The total number of machines. Defaults to 1.
`NODE_RANK`	The index of the local machine. Defaults to 0.
`PORT`	The communication port, it should be the same in all machines. Defaults to 29500.
`MASTER_ADDR`	The IP address of the master machine, it should be the same in all machines. Defaults to `127.0.0.1`.

Usually it is slow if you do not have high speed networking like InfiniBand.

Multiple machines managed with slurm¶

If you run MMAction2 on a cluster managed with slurm, you can use the script slurm_train.sh.

[ENV_VARS] bash tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG} [PY_ARGS]

Here are the arguments description of the script.

ARGS	Description
`PARTITION`	The partition to use in your cluster.
`JOB_NAME`	The name of your job, you can name it as you like.
`CONFIG`	The path to the config file.
`[PYARGS]`	The other optional arguments of `tools/train.py`, see here.

Here are the environment variables can be used to configure the slurm job.

ENV_VARS	Description
`GPUS`	The number of GPUs to be used. Defaults to 8.
`GPUS_PER_NODE`	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
`SRUN_ARGS`	The other arguments of `srun`. Available options can be found here.

Test¶

Test with your PC¶

You can use tools/test.py to test a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

Note

By default, MMAction2 prefers GPU to CPU. If you want to test a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program.

CUDA_VISIBLE_DEVICES=-1 python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [ARGS]

ARGS	Description
`CONFIG_FILE`	The path to the config file.
`CHECKPOINT_FILE`	The path to the checkpoint file (It can be a http link)
`--work-dir WORK_DIR`	The directory to save the file containing evaluation metrics. Defaults to a folder with the same name of the config file under `./work_dirs`.
`--dump DUMP`	The path to dump all outputs of the model for offline evaluation.
`--cfg-options CFG_OPTIONS`	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either `key="[a,b]"` or `key=a,b`. The argument also allows nested list/tuple values, e.g. `key="[(a,b),(c,d)]"`. Note that the quotation marks are necessary and that no white space is allowed.
`--show-dir SHOW_DIR`	The directory to save the result visualization images.
`--show`	Visualize the prediction result in a window.
`--interval INTERVAL`	The interval of samples to visualize. Defaults to 1.
`--wait-time WAIT_TIME`	The display time of every window (in seconds). Defaults to 2.
`--launcher {none,pytorch,slurm,mpi}`	Options for job launcher. Defaults to `none`.

Test with multiple GPUs¶

We provide a shell script to start a multi-GPUs task with torch.distributed.launch.

bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]

ARGS	Description
`CONFIG`	The path to the config file.
`CHECKPOINT`	The path to the checkpoint file (It can be a http link)
`GPUS`	The number of GPUs to be used.
`[PYARGS]`	The other optional arguments of `tools/test.py`, see here.

You can also specify extra arguments of the launcher by environment variables. For example, change the communication port of the launcher to 29666 by the following command:

PORT=29666 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} ${GPUS} [PY_ARGS]

If you want to startup multiple test jobs and use different GPUs, you can launch them by specifying different port and visible devices.

CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 bash tools/dist_test.sh ${CONFIG} ${CHECKPOINT} 4 [PY_ARGS]

Test with multiple machines¶

Multiple machines in the same network¶

If you launch a test job with multiple machines connected with ethernet, you can run the following commands:

On the first machine:

NNODES=2 NODE_RANK=0 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS

On the second machine:

NNODES=2 NODE_RANK=1 PORT=$MASTER_PORT MASTER_ADDR=$MASTER_ADDR bash tools/dist_test.sh $CONFIG $CHECKPOINT $GPUS

Compared with multi-GPUs in a single machine, you need to specify some extra environment variables:

ENV_VARS	Description
`NNODES`	The total number of machines. Defaults to 1.
`NODE_RANK`	The index of the local machine. Defaults to 0.
`PORT`	The communication port, it should be the same in all machines. Defaults to 29500.
`MASTER_ADDR`	The IP address of the master machine, it should be the same in all machines. Defaults to `127.0.0.1`.

Usually it is slow if you do not have high speed networking like InfiniBand.

Multiple machines managed with slurm¶

If you run MMAction2 on a cluster managed with slurm, you can use the script slurm_test.sh.

[ENV_VARS] bash tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG} ${CHECKPOINT} [PY_ARGS]

Here are the arguments description of the script.

ARGS	Description
`PARTITION`	The partition to use in your cluster.
`JOB_NAME`	The name of your job, you can name it as you like.
`CONFIG`	The path to the config file.
`CHECKPOINT`	The path to the checkpoint file (It can be a http link)
`[PYARGS]`	The other optional arguments of `tools/test.py`, see here.

Here are the environment variables can be used to configure the slurm job.

ENV_VARS	Description
`GPUS`	The number of GPUs to be used. Defaults to 8.
`GPUS_PER_NODE`	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	The number of CPUs to be allocated per task (Usually one GPU corresponds to one task). Defaults to 5.
`SRUN_ARGS`	The other arguments of `srun`. Available options can be found here.