之前跑多卡并行一直用的 DaraParallel
,存在负载不均衡的问题。从师兄那里了解到 Accelerate
库,折腾了一阵子,解决了多卡并行负载均衡和混合精度训练的问题。在此将配置的详情和踩过的坑作以记录。
Accelerate
是一个 PyTorch 的分布式训练库,可以实现多卡并行和混合精度训练。它的优点是简单易用,只需要在原有代码的基础上添加几行代码即可实现多卡并行和混合精度训练。其实现原理是通过 torch.distributed
库实现的,但是 Accelerate
库对 torch.distributed
库进行了封装,使得使用起来更加方便。(本段由 Copilot 生成)
一、安装
conda install -c conda-forge accelerate
或者
pip install accelerate
二、配置
安装完成后,首先需要运行 accelerate config
,通过 CLI 交互完成配置,之后会在 ~/.cache/huggingface/accelerate
目录下生成 default_config.json
文件。配置的详细过程如下:
❯ accelerate config ─╯
----------------------------In which compute environment are you running?
Please select a choice using the arrow or number keys, and selecting with enter
➔ This machine
AWS (Amazon SageMaker)
----------------------------Which type of machine are you using?
Please select a choice using the arrow or number keys, and selecting with enter
No distributed training
multi-CPU
multi-XPU
➔ multi-GPU
multi-NPU
TPU
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1 # 选择你要用的机器数量
Should distributed operations be checked while running for errors? This can avoid timeout issues but will be slower. [yes/NO]: no
Do you wish to optimize your script with torch dynamo?[yes/NO]:no
Do you want to use DeepSpeed? [yes/NO]: yes
Do you want to specify a json file to a DeepSpeed config? [yes/NO]: no
----------------------------What should be your DeepSpeed's ZeRO optimization stage?
Please select a choice using the arrow or number keys, and selecting with enter
0
1
2
➔ 3
----------------------------Where to offload optimizer states?
Please select a choice using the arrow or number keys, and selecting with enter
none
➔ cpu
nvme
----------------------------Where to offload parameters?
Please select a choice using the arrow or number keys, and selecting with enter
none
➔ cpu
nvme
How many gradient accumulation steps you're passing in your script? [1]: 4 # 梯度累积,若不需要梯度累积,直接回车
Do you want to use gradient clipping? [yes/NO]: no # 梯度裁剪
Do you want to save 16-bit model weights when using ZeRO Stage-3? [yes/NO]: no
Do you want to enable `deepspeed.zero.Init` when using ZeRO Stage-3 for constructing massive models? [yes/NO]: no
How many GPU(s) should be used for distributed training? [1]:2 # 选择你要用的 GPU 数量
----------------------------Do you wish to use FP16 or BF16 (mixed precision)?
Please select a choice using the arrow or number keys, and selecting with enter
no # 不使用混合精度训练
➔ fp16 # fp16, 通常适用于 NVIDIA GPU
bf16 # bf16, 通常适用于 TPU
fp8
accelerate configuration saved at /home/variantconst/.cache/huggingface/accelerate/default_config.yaml
三、使用
关于这部分可以参考 Accelerate 官方文档,文档里有一个很方便的 QuickTour 和 常用代码片段,包含了各种常见用法,这里只是简单的记录一下。
官方宣传的是,只要在原有 Pytorch 代码基础上添加四行代码就行了。主要的改动就是定义好 model
等之后交给 accelerator
实例进行打包,在反向传播时使用 accelerator.backward
代替 loss.backward
。
+ from accelerate import Accelerator
+ accelerator = Accelerator()
- device = 'cuda'
+ device = accelerator.device
model.to(device)
+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+ model, optimizer, training_dataloader, scheduler
+ )
for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
- inputs = inputs.to(device) # 不需要手动将数据放到 GPU 上
- targets = targets.to(device)
outputs = model(inputs)
loss = loss_function(outputs, targets)
+ accelerator.backward(loss) # 由 accelerator 完成反向传播
optimizer.step()
scheduler.step()
不过在实际使用中,难免要遇到一些问题:
- 半精度训练
在半精度训练时,遇到了 Half
和 float
类型不匹配的报错。这就需要把 loss
的计算用 accelerator.autocast
包裹起来,类似于 torch.cuda.amp.autocast
的用法。
with accelerator.autocast():
loss = loss_function(outputs, targets)
- 梯度累积
如果要使用梯度累计,需要在初始化 accelerator
时传入需要累积的步数,然后把每个 batch 的计算包裹在 accelerator.accumulate
中。
accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader = accelerator.prepare(model, optimizer, training_dataloader)
for input, label in training_dataloader:
with accelerator.accumulate(model):
predictions = model(input)
loss = loss_function(predictions, label)
accelerator.backward(loss)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
- 训练中输出中间变量
如果需要在训练过程中保存一些图片,一定要指定在主进程中进行,否则会出现非标准的行为。
if accelerator.is_local_main_process:
# Is executed once per server
类似的,更新 tqdm
进度条也需要在主进程中进行:
from tqdm.auto import tqdm
progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
假如只是需要打印一些 log,可以简单的改用 accelerator.print
。