-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added Recurrent Batch Normalization #163
base: master
Are you sure you want to change the base?
Conversation
Thanks! Curious - have you tested if this works better? |
I had the same question, and I just deployed it to our servers. I'll come back with more results! |
Hey @iassael did you have different mean/variance for each timestep? Or a shared mean/variance over all timesteps of one batch? The paper said " Consequently, we recommend using separate statistics for each timestep to preserve information of the initial transient phase in the activations.". |
UPDATE: Check my reply below. Hi @windweller you are right. In this case, following the current project structure, the statistics were calculated overall. |
@windweller, looking at the implementation of Therefore, even when we the proto.rnn is cloned, each Hence, the implementation is acting as recommended in the paper. Thank you for pointing it out! |
Quick note: there is no need to implement |
Can I ask what the motivation is for removing biases from that linear layer? (haven't read the BN LSTM papers yet). Is this just to avoid redundancy? Also, is it a big deal if this wasn't done? Also, is this code fully backwards compatible and identical in functionality? And how would the code behave if someone has an older version of torch that does not have the LinearNB patch? EDIT: e.g. it seems to me that due to the additional |
Hi @karpathy, the motivation is exactly to avoid redundancy. This saves 2*rnn_size parameters. In our case it is the 256 / 239297 (~0.1%) of the model's parameters (default settings), which is not significant, and therefore, it could be ignored. In terms of backward compatibility, a redundant parameter passed to a function in Lua is ignored. Therefore, although the layer would have slightly different behavior, it should still maintain backward compatibility, and in both cases, it should work perfectly. A simple example is the following:
|
Following the implementation of Recurrent Batch Normalization http://arxiv.org/abs/1603.09025, the code implements Batch-Normalized LSTMs.