Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

11 implement moe #116

Draft
wants to merge 37 commits into
base: main
Choose a base branch
from
Draft

11 implement moe #116

wants to merge 37 commits into from

Conversation

AbasKhan
Copy link
Collaborator

No description provided.

@AbasKhan AbasKhan requested a review from le1nux April 29, 2024 09:43
@AbasKhan AbasKhan self-assigned this Apr 29, 2024
@AbasKhan AbasKhan linked an issue Apr 29, 2024 that may be closed by this pull request
Copy link
Contributor

@lllAlexanderlll lllAlexanderlll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Well done! :)
For me, the tests run through locally, but github actions seems to have a problem with installing modalities for the tests.
When running pre-commit run --all-files locally, two files will be altered. This should get fixed, please.

A clarification question:
In this initial MoE implementation, we could face inefficient training where only a few lucky experts get trained intensively, as there is no auxiliary loss, expert capacity or similar technique implemented, yet, right?
E.g. as mentioned here https://huggingface.co/blog/moe#load-balancing-tokens-for-moes

Comment on lines 474 to 477
# if poe_type is not PositionTypes.NOPE and RotaryTransform in [
# config.type_hint.value for config in attention_config.qkv_transforms
# ]:
# raise ValueError('It is expected to use "RotaryTransform" together with "NOPE".')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be deleted?

Copy link
Collaborator Author

@AbasKhan AbasKhan May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had commented out this part during the development, I re-added the check in the code.

self.layer = nn.Linear(self.hidden_size, self.moe_num_experts, bias=False)

def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.LongTensor]:
if self.training and self.moe_jitter_eps is not None:
Copy link
Contributor

@lllAlexanderlll lllAlexanderlll May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add a small section in the README or here as part of the docs explaining what the different flags here do/ what their purpose is? I.e. jitter, normalization of expert weights and uniform expert assignment.

x.view(-1, x.shape[-1]) is the information basis for the routing. What does it translate to? It is not an aggregate of embeddings, but seems only to get rid of the first dimension by merging? As this is usually the batch_size dimension, we get a resulting Matrix of size [batch_size * context_length, hidden_size]. We do this, as we must select for each token top_k experts, so that each expert ends up with processing a bunch of contextualized embeddings. One could imagine it as a "gap text" for each expert with different, potentially overlapping gaps and non-gaps between gap texts.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, i really liked the "gap text" analogy/explanation, nice way of visualizing whats going on . Plus the first dimension i think is more being combined with the sequence length (sounds more accurate than getting rid i guess) to create a batch of sequence elements, each of which will be independently routed to the experts.

Copy link
Collaborator

@mrudat-iais mrudat-iais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code was very readable, especially with the tests you added. We left comments regarding implementation changes.
Besides that, the MoEExperts.forward() is still unclear to us. Would be nice to go through the code there together ;-)

src/modalities/nn/moe.py Show resolved Hide resolved
src/modalities/nn/moe.py Show resolved Hide resolved
tests/nn/test_moe.py Show resolved Hide resolved
)

def forward(self, x: torch.Tensor, top_weights: torch.Tensor, top_experts: torch.LongTensor) -> torch.Tensor:
bsz, q_len, hidden_size = x.shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bsz, q_len, hidden_size = x.shape
batch_size, sequence_length, hidden_size = x.shape

hidden_size=hidden_size, ffn_hidden_size=ffn_hidden_size, moe_num_experts=moe_num_experts, act_fn=act_fn
)

def forward(self, x: torch.Tensor, top_weights: torch.Tensor, top_experts: torch.LongTensor) -> torch.Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to go through this together. We were just debugging but did not get the fill picture of what happens here.

):
super().__init__()
self.moe_num_experts = moe_num_experts
self.mlp = MoEExpertGLU(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is mlp a good variable name here? It is also used in gpt2_model.py for MoEFFN

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agreed that name is probably not the best. We just wanted to be consistent with the names already used, as you can see here

self.mlp = TransformerMLP(n_embd=n_embd, ffn_hidden=ffn_hidden, bias=bias, dropout=dropout)
.

@AbasKhan AbasKhan requested a review from mrudat-iais June 4, 2024 11:03
Copy link
Collaborator

@mrudat-iais mrudat-iais left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has been reviewed as a peer review. nice work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement MoE
7 participants