-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
11 implement moe #116
base: main
Are you sure you want to change the base?
11 implement moe #116
Conversation
One currently failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Well done! :)
For me, the tests run through locally, but github actions seems to have a problem with installing modalities for the tests.
When running pre-commit run --all-files
locally, two files will be altered. This should get fixed, please.
A clarification question:
In this initial MoE implementation, we could face inefficient training where only a few lucky experts get trained intensively, as there is no auxiliary loss, expert capacity or similar technique implemented, yet, right?
E.g. as mentioned here https://huggingface.co/blog/moe#load-balancing-tokens-for-moes
# if poe_type is not PositionTypes.NOPE and RotaryTransform in [ | ||
# config.type_hint.value for config in attention_config.qkv_transforms | ||
# ]: | ||
# raise ValueError('It is expected to use "RotaryTransform" together with "NOPE".') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had commented out this part during the development, I re-added the check in the code.
self.layer = nn.Linear(self.hidden_size, self.moe_num_experts, bias=False) | ||
|
||
def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.LongTensor]: | ||
if self.training and self.moe_jitter_eps is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to add a small section in the README or here as part of the docs explaining what the different flags here do/ what their purpose is? I.e. jitter, normalization of expert weights and uniform expert assignment.
x.view(-1, x.shape[-1])
is the information basis for the routing. What does it translate to? It is not an aggregate of embeddings, but seems only to get rid of the first dimension by merging? As this is usually the batch_size dimension, we get a resulting Matrix of size [batch_size * context_length, hidden_size]. We do this, as we must select for each token top_k experts, so that each expert ends up with processing a bunch of contextualized embeddings. One could imagine it as a "gap text" for each expert with different, potentially overlapping gaps and non-gaps between gap texts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so too, i really liked the "gap text" analogy/explanation, nice way of visualizing whats going on . Plus the first dimension i think is more being combined with the sequence length (sounds more accurate than getting rid i guess) to create a batch of sequence elements, each of which will be independently routed to the experts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code was very readable, especially with the tests you added. We left comments regarding implementation changes.
Besides that, the MoEExperts.forward() is still unclear to us. Would be nice to go through the code there together ;-)
) | ||
|
||
def forward(self, x: torch.Tensor, top_weights: torch.Tensor, top_experts: torch.LongTensor) -> torch.Tensor: | ||
bsz, q_len, hidden_size = x.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bsz, q_len, hidden_size = x.shape | |
batch_size, sequence_length, hidden_size = x.shape |
hidden_size=hidden_size, ffn_hidden_size=ffn_hidden_size, moe_num_experts=moe_num_experts, act_fn=act_fn | ||
) | ||
|
||
def forward(self, x: torch.Tensor, top_weights: torch.Tensor, top_experts: torch.LongTensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to go through this together. We were just debugging but did not get the fill picture of what happens here.
): | ||
super().__init__() | ||
self.moe_num_experts = moe_num_experts | ||
self.mlp = MoEExpertGLU( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mlp a good variable name here? It is also used in gpt2_model.py for MoEFFN
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed that name is probably not the best. We just wanted to be consistent with the names already used, as you can see here
self.mlp = TransformerMLP(n_embd=n_embd, ffn_hidden=ffn_hidden, bias=bias, dropout=dropout) |
…ling for NOPE and RotaryTransform to wok even when attention_config is not passed
…nt config files for MoE inference run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has been reviewed as a peer review. nice work!
No description provided.