You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These divisibility requirements are from FP8 Tensor Cores. The simplest fix is to pad to the nearest multiple of 32, but also this compute seems too small to get full GPU utilization. It may be better to disable FP8 for small layers to avoid the extra overhead:
# Construct modellayer1=te.Linear(712, 896)
layer2=te.Linear(896, 4096)
layer3=te.Linear(4096, 4096)
# Forward pass: layer1 in FP32, layer2 and layer3 in FP8x=torch.randn(4096, 712)
withte.fp8_autocast():
withte.fp8_autocast(enabled=False):
x=layer1(x)
x=layer2(x)
x=layer3(x)
loss=loss_fn(x)
# Backward passloss.backward()
AssertionError: FP8 execution requires 2D input matrices with height divisible by 8 and width divisible by 16, but got tensor with dims=[896, 712]
The text was updated successfully, but these errors were encountered: