Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support megatron 0.6 in veRL #85

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Chendong98
Copy link

I am opening this PR with the hope of adding veRL support to Megatron 0.6 (although I noticed that the veRL paper seems to have already used Megatron 0.6 as the test version). From my naive perspective, I envision two possible approaches:

  1. Communication at the parameter level.
  2. Creating a MemoryBuffer in veRL that is fully aligned with the ParamAndGradBuffer in Megatron 0.6, and then performing broadcast and other communication operations based on this buffer.

communication-demo

In the current draft, when self._pp_rank == pp_rank, it directly uses the buffer defined in Megatron 0.6 (without even checking if use_distributed_optimizer is set), and communicates at the parameter level during parameter synchronization—this, of course, incurs some performance overhead.

At the very least, this approach seems feasible.

Signed-off-by: chendong-1998 <[email protected]>
Signed-off-by: chendong-1998 <[email protected]>
@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 8, 2025

It looks really nice, we'll take some time to check how to align the two buffers to accelerate the resharding process.

@PeterSH6
Copy link
Collaborator

PeterSH6 commented Jan 8, 2025

Another question is whether there are any issues in MCore 0.6?
If not, we may not need to patch the upstream megatron anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants