DeepSpeed v0.10.0
New features
- ZeRO++: A leap in speed for LLM and chat model training with 4X less communication[English] [中文] [日本語]
- H100 support and testing w. FP8 using NVIDIA's TransformerEngine
What's Changed
- Documentation for DeepSpeed Accelerator Abstraction Interface by @delock in #3184
- FP8 unittest for H100 by @jomayeri in #3731
- Fix apex install bugs by @loadams in #3741
- Fix Autotuner get_gas_from_user_config by @straywarrior in #3664
- Include cublas error details when getting cublas handle fails by @jli in #3695
- fix hybrid engine mlp module by @tensor-tang in #3736
- Fix output transpose dimension bugs by @loadams in #3747
- remove UtilsBuilder load, use torch (un)flatten ops by @inkcherry in #3728
- add Chinese Zhihu social account by @conglongli in #3755
- Account for expert parameters when calculating the total number of pa… by @alito in #3720
- fix ccl_backend and residual_add problems by @dc3671 in #3642
- Fix url in getting-started guide (docs) by @acforvs in #3768
- Update deepspeed-chat/japanese/README.md by @eltociear in #3765
- Add H100 workflow and status badge. by @loadams in #3754
- Add an api in deepspeed engine for adjusting micro batch size during training by @kisseternity in #3773
- Prevent hangs in CI during parallel run compilation by @mrwyattii in #2844
- Revert "Prevent hangs in CI during parallel run compilation" by @jeffra in #3817
- [Docs]
chrome://tracing
is deprecated by @keyboardAnt in #3805 - Support model declaration in zero.Init context by @tohtana in #3592
- Update zeropp.md by @samadejacobs in #3821
- Reduce Unit Test Times (Part 1) by @mrwyattii in #3829
- Re-enable GPT-J unit tests and refactor inference tests by @mrwyattii in #3618
- Fix racing condition in GatheredParameters by @HeyangQin in #3819
- zero/mics.py: use on_accelerator instead of cuda only by @guoyejun in #3806
- Disable AMD test flows in YML by @loadams in #3847
- Reduce Unit Test Time (Part 2) by @mrwyattii in #3838
- [profiling]add show_straggler argument to log_summary() by @delock in #3579
- checking process_group before merging bucket ranges (#3521) by @clumsy in #3577
- scripts/check-torchcuda.py: add checking for tensor.is_cuda by @guoyejun in #3843
- Zero3 Fix allreduce optimization for extra large tensor by @hablb in #3832
- [zero] revert PR #3166, it disabled grad clip for bf16 by @jeffra in #3790
- Fix transpose convolution FLOPS profiler (retrieval of out_channels) by @pinstripe-potoroo in #3834
- Fix LoRA Fuse/Unfuse in Hybrid Engine by @sxjscience in #3563
- Update pytorch-lightning version in CI by @mrwyattii in #3882
- [Docs] MMEngine has integrated deepspeed. by @HAOCHENYE in #3879
- Add FALCON Auto-TP Support by @RezaYazdaniAminabadi in #3640
- Update apex installation to resolve apex's pyproject.toml issues. by @loadams in #3745
- Extend HE-Lora test with Z3 support + Fix/add guard in HE for Z3 by @awan-10 in #3883
- Separate ZeRO3 InflightParamRegistry for train and eval by @HeyangQin in #3884
- Add GPTNeoX AutoTP support by @Yejing-Lai in #3778
- Fix Meta Tensor checkpoint load for BLOOM models by @lekurile in #3885
- fix error :Dictionary expression not allowed in type annotation Pylance by @digger-yu in #3708
- Fix rnn flop profiler to compute flops instead of macs by @pinstripe-potoroo in #3833
- Update workflows for merge queue by @mrwyattii in #3892
- Avoid deprecation warnings in
CHECK_CUDA
by @Flamefire in #3854 - Silence comm.py warning by @mrwyattii in #3893
- Fix a typo of global variable in comm.py by @hipudding in #3852
- [ROCm] Enable TestCUDABackward::test_backward unit tests by @rraminen in #3849
- [profiling][mics]Fix some issues for log_summary(). by @ys950902 in #3899
- fix "undefined symbol: curandCreateGenerator" for quantizer op by @jinzhen-lin in #3846
- fix memory leak with zero-3 by @jeffra in #3903
- fix some typo docs/ by @digger-yu in #3917
- fix: change ==NONE to is under deepspeed/ by @digger-yu in #3923
- Del comment deepspeed.zero.Init() can be used as a decorator by @hipudding in #3894
- Remove the param.ds_tensor from print by @HeyangQin in #3928
- Reduce Unit Test Times (Part 3) by @mrwyattii in #3850
- Update zero_to_fp32.py - to support deepspeed_stage_1 by @PicoCreator in #3936
- [docs] add xTrimoPGLM by @jeffra in #3940
- Update Nvidia docker base image by @KaiChen1008 in #3930
- Fix inference tutorial docs for checkpoints by @loadams in #3955
- fix Megatron-DeepSpeed links by @conglongli in #3956
- skip bcast when enable pp but pp_group_size=1 by @inkcherry in #3915
- Use device_name instead of device index to support other device by @hipudding in #3933
- Create accelerator for apple silicon GPU Acceleration by @NripeshN in #3907
- fix(cpu_accelerator): 🐛 Convert LOCAL_SIZE to integer by @javsalgar in #3971
New Contributors
- @straywarrior made their first contribution in #3664
- @alito made their first contribution in #3720
- @acforvs made their first contribution in #3768
- @keyboardAnt made their first contribution in #3805
- @pinstripe-potoroo made their first contribution in #3834
- @HAOCHENYE made their first contribution in #3879
- @Yejing-Lai made their first contribution in #3778
- @Flamefire made their first contribution in #3854
- @hipudding made their first contribution in #3852
- @PicoCreator made their first contribution in #3936
- @KaiChen1008 made their first contribution in #3930
- @NripeshN made their first contribution in #3907
- @javsalgar made their first contribution in #3971
Full Changelog: v0.9.4...v0.10.0