|
| 1 | +# Transformers Example with Ignite |
| 2 | + |
| 3 | +In this example, we show how to use _Ignite_ to finetune a transformer model: |
| 4 | + |
| 5 | +- on 1 or more GPUs or TPUs |
| 6 | +- compute training/validation metrics |
| 7 | +- log learning rate, metrics etc |
| 8 | +- save the best model weights |
| 9 | + |
| 10 | +Configurations: |
| 11 | + |
| 12 | +- [x] single GPU |
| 13 | +- [x] multi GPUs on a single node |
| 14 | +- [x] TPUs on Colab |
| 15 | + |
| 16 | +## Requirements: |
| 17 | + |
| 18 | +- pytorch-ignite: `pip install pytorch-ignite` |
| 19 | +- [transformers](https://github.com/huggingface/transformers): `pip install transformers` |
| 20 | +- [datasets](https://github.com/huggingface/datasets): `pip install datasets` |
| 21 | +- [tqdm](https://github.com/tqdm/tqdm/): `pip install tqdm` |
| 22 | +- [tensorboardx](https://github.com/lanpa/tensorboard-pytorch): `pip install tensorboardX` |
| 23 | +- [python-fire](https://github.com/google/python-fire): `pip install fire` |
| 24 | +- Optional: [clearml](https://github.com/allegroai/clearml): `pip install clearml` |
| 25 | + |
| 26 | +Alternatively, install the all requirements using `pip install -r requirements.txt`. |
| 27 | + |
| 28 | +## Usage: |
| 29 | + |
| 30 | +Run the example on a single GPU: |
| 31 | + |
| 32 | +```bash |
| 33 | +python main.py run |
| 34 | +``` |
| 35 | +If needed, please, adjust the batch size to your GPU device with `--batch_size` argument. |
| 36 | + |
| 37 | +For details on accepted arguments: |
| 38 | + |
| 39 | +```bash |
| 40 | +python main.py run -- --help |
| 41 | +``` |
| 42 | + |
| 43 | + |
| 44 | +### Distributed training |
| 45 | + |
| 46 | +#### Single node, multiple GPUs |
| 47 | + |
| 48 | +Let's start training on a single node with 2 gpus: |
| 49 | + |
| 50 | +```bash |
| 51 | +# using torch.distributed.launch |
| 52 | +python -u -m torch.distributed.launch --nproc_per_node=2 --use_env main.py run --backend="nccl" |
| 53 | +``` |
| 54 | + |
| 55 | +or |
| 56 | + |
| 57 | +```bash |
| 58 | +# using function spawn inside the code |
| 59 | +python -u main.py run --backend="nccl" --nproc_per_node=2 |
| 60 | +``` |
| 61 | + |
| 62 | +##### Using [Horovod](https://horovod.readthedocs.io/en/latest/index.html) as distributed backend |
| 63 | + |
| 64 | +Please, make sure to have Horovod installed before running. |
| 65 | + |
| 66 | +Let's start training on a single node with 2 gpus: |
| 67 | + |
| 68 | +```bash |
| 69 | +# horovodrun |
| 70 | +horovodrun -np=2 python -u main.py run --backend="horovod" |
| 71 | +``` |
| 72 | + |
| 73 | +or |
| 74 | + |
| 75 | +```bash |
| 76 | +# using function spawn inside the code |
| 77 | +python -u main.py run --backend="horovod" --nproc_per_node=2 |
| 78 | +``` |
| 79 | + |
| 80 | +#### Colab or Kaggle kernels, on 8 TPUs |
| 81 | + |
| 82 | +```python |
| 83 | +# setup TPU environment |
| 84 | +import os |
| 85 | +assert os.environ['COLAB_TPU_ADDR'], 'Make sure to select TPU from Edit > Notebook settings > Hardware accelerator' |
| 86 | +``` |
| 87 | +```bash |
| 88 | +VERSION = "nightly" |
| 89 | +!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py |
| 90 | +!python pytorch-xla-env-setup.py --version $VERSION > /dev/null |
| 91 | +``` |
| 92 | + |
| 93 | +```python |
| 94 | +from main import run |
| 95 | +run(backend="xla-tpu", nproc_per_node=8) |
| 96 | +``` |
| 97 | + |
| 98 | +## ClearML fileserver |
| 99 | + |
| 100 | +If `ClearML` server is used (i.e. `--with_clearml` argument), the configuration to upload artifact must be done by |
| 101 | +modifying the `ClearML` configuration file `~/clearml.conf` generated by `clearml-init`. According to the |
| 102 | +[documentation](https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html), the `output_uri` argument can be |
| 103 | +configured in `sdk.development.default_output_uri` to fileserver uri. If server is self-hosted, `ClearML` fileserver uri is |
| 104 | +`http://localhost:8081`. |
| 105 | + |
| 106 | +For more details, see https://allegro.ai/clearml/docs/docs/examples/reporting/artifacts.html |
0 commit comments