-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When learning with Transformer, loss becomes nan after backpropagation. #37
Comments
Applying attention masks with model = SpeechTransformer(
num_classes=10, d_model=64, input_dim=80, d_ff=256,
num_encoder_layers=3, num_decoder_layers=3)
inputs = torch.rand((32, 16, 80), dtype=torch.float)
targets = torch.randint(0, 10, (32, 16), dtype=torch.long)
lengths = torch.empty((32,), dtype=torch.long).fill_(80)
with torch.no_grad():
predicted = model(inputs, lengths, targets)
print(np.isnan(predicted.numpy()).any()) Using
Using
The implementations of Transformer model usually choose the constant (and bounded) values instead of unusual ones (e.g.
Sometimes you can see some with |
I never thought there was a bug in that part! Thank you, I'll try it out! |
After experimenting, loss becomes nan again. There was this problem, but there seems to be another problem. |
And if you refer to this repo, it works normally when it is -np.inf. Further checks are likely to be needed on that part. |
Let's check why this repository works well with pred = np.random.rand(64, 32, 1024)
pred = np.where(pred < 0.999999, pred, np.nan)
pred = torch.tensor(pred, dtype=torch.float)
target = np.random.randint(0, 1024, (64, 32), dtype=np.long)
target = np.where(np.any(np.isnan(pred), axis=-1), 0, target)
target = torch.tensor(target, dtype=torch.long)
loss = LabelSmoothedCrossEntropyLoss(
num_classes=1024,
ignore_index=0,
smoothing=0.1,
architecture='transformer',
reduction='mean')
print(loss(pred.view(-1, pred.size(-1)), target.view(-1))) Output:
print(cal_loss(pred.view(-1, pred.size(2)), target.view(-1),
smoothing=0.1)) Output: Why is it happend? Actually, they both work well without label-smoothing. The problem is in reducing the loss tensor.
# ...
with torch.no_grad():
label_smoothed = torch.zeros_like(logit)
label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
label_smoothed[target == self.ignore_index, :] = 0
return self.reduction_method(-label_smoothed * logit)
# ...
# ...
non_pad_mask = gold.ne(IGNORE_ID)
n_word = non_pad_mask.sum().item()
loss = -(one_hot * log_prb).sum(dim=1)
loss = loss.masked_select(non_pad_mask).sum() / n_word
# ... While your code reduces the smoothed logits, So if you want to use with torch.no_grad():
label_smoothed = torch.zeros_like(logit)
label_smoothed.fill_(self.smoothing / (self.num_classes - 1))
label_smoothed.scatter_(1, target.data.unsqueeze(1), self.confidence)
# label_smoothed[target == self.ignore_index, :] = 0
score = (-label_smoothed * logit).sum(1)
score = score.masked_select(target != self.ignore_index)
return self.reduction_method(score) Output: |
oh thanks to let me know. |
I've never seen |
Never mind. |
No. Basically Transformer model with post-LN needs learning rate warm-up. You need to consider that. I don't have any dataset of this project so I cannot test your code accurately. When does the loss diverge? Can you show me the training logs in detail? |
Can you come to gitter and talk to me in real time? |
마스킹에 문제있는 것을 확인 => 디버깅중 |
Currently, Seq2seq and Transformer have two models implemented, and after backpropagation when learning with Transformer, the phenomenon of loss becoming nan continues. I have tried debugging, but I have not yet confirmed which part is wrong. If you have had a similar experience or have any guesses, I would appreciate it if you could help me.
The text was updated successfully, but these errors were encountered: