-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
integrate with Lightning ecosystem CI #3502
Comments
in fact, we have already added it but it would be nice if you could verify the testing we did... |
Hey @Borda Thanks for the update ! I reviewed the Nemo config file in that pr, and mostly it seems great. The one thing popped up was the pytorch version for the template.md file - Nemo depends on the latest pytorch version almost always. So pt 1.10 is a requirement (especially for stuff like Megatron). In general, Nemo depends on PTL to support the latest public pytorch version. Furthermore, Nemo tests it's own CI on a PT container from NGC (versions as YY.MM) and these are generally available on other cloud providers too. Note that - usually once pytorch has a public release, it takes at least one month for the next container to actually have the public pytorch release. By actually have the released pytorch, I mean that the current container will have an alpha release of pytorch with some cherry picked changes vs the actual full new release in public. This can cause cases where improper version checking (using distutils instead of packaging.version.Version) can fail these alpha version comparison tests and cause PTL inside of the container to pick incorrect code paths. So the ecosystem CI will work fine ... but when you run it on a pytorch container released from Nvidia (ie on most cloud providers) it may fail (and not just Nemo, anything that uses PTL and hits that code path). So maybe on a seperate test prior to release, test the ecosystem CI on the latest public NGC pytorch container. Ofc this is a big task so it's just a suggestion. |
One thing - Nemo has a monthly release cycle - and generally releases at the last day of the month (or the closest working day). There are exceptions but that's generally the case. Around halfway the month, we branch of our main branch and have a release branch with the pattern "r1.x.y". X usually goes up 1-2-3 etc and Y is generally 0 - currently we will release r1.6.0. So if you can test against both main and the latest release branch, if that is possible to do on the ecosystem CI, then it would be safer. Otherwise main only is also fine since we routinely merge back the release brand into main anyway and at the end of the release too. |
On the previous discussion of Numba support - would be beneficial if a conda env with Numba 0.54/0.55 were available to test a few things in Nemo such as RNNT losses. However, it is not a necessity as Nemo itself tests it, and PTL has a very low probability of breaking it. It's breakage occurs mainly due to Pytorch or CUDA incompatibilities against Numba itself (the containers used in Nemo's CI have the latest CUDA and we do test RNNT there). |
this is fine, you can define any PT version in the config file
Yes, we have tests for let's an mostly also for future releasses
this is a good suggestion, maybe open a feature request on the EcoCI side to enable custom docker images...
this shall be fine as you can use custom checkout which can be tag or branch
sure, you just add two config files for each
that would be good feature request on EcoCI side :] |
Sounds good, I'll try setting those up on the ecosystem repo over the next week or two. |
Sorry for delay, major PRs needed to be merged for release. Added two feature requests for numba and CI test against container, Lightning-AI/ecosystem-ci#38 and Lightning-AI/ecosystem-ci#39. The configs for nemo seem correct but there's a newer version now, how should we update it? Submit a PR? https://github.com/PyTorchLightning/ecosystem-ci/blob/main/configs/NVIDIA/nemo-release.yaml |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
This issue was closed because it has been inactive for 7 days since being marked as stale. |
Hello and so happy to see you use Pytorch-Lightning! 🎉
Just wondering if you already heard about quite the new Pytorch Lightning (PL) ecosystem CI where we would like to invite you to... You can check out our blog post about it: Stay Ahead of Breaking Changes with the New Lightning Ecosystem CI ⚡
As you use PL framework for your cool project, we would like to enhance your experience and offer you safe updates to our future releases. At this moment, you run tests with a particular PL version, but it may accidentally happen that the next version will be incompatible with your project... 😕 We do not intend to change anything on our project side, but still here we have a solution - ecosystem CI with testing both - your and our latest development head we can find it very early and prevent releasing eventually bad version... 👍
What is needed to do?
What will you get?
The text was updated successfully, but these errors were encountered: