Release DeepSpeed v0.10.0 · microsoft/DeepSpeed

New features

ZeRO++: A leap in speed for LLM and chat model training with 4X less communication[English] [中文] [日本語]
H100 support and testing w. FP8 using NVIDIA's TransformerEngine

What's Changed

Documentation for DeepSpeed Accelerator Abstraction Interface by @delock in #3184
FP8 unittest for H100 by @jomayeri in #3731
Fix apex install bugs by @loadams in #3741
Fix Autotuner get_gas_from_user_config by @straywarrior in #3664
Include cublas error details when getting cublas handle fails by @jli in #3695
fix hybrid engine mlp module by @tensor-tang in #3736
Fix output transpose dimension bugs by @loadams in #3747
remove UtilsBuilder load, use torch (un)flatten ops by @inkcherry in #3728
add Chinese Zhihu social account by @conglongli in #3755
Account for expert parameters when calculating the total number of pa… by @alito in #3720
fix ccl_backend and residual_add problems by @dc3671 in #3642
Fix url in getting-started guide (docs) by @acforvs in #3768
Update deepspeed-chat/japanese/README.md by @eltociear in #3765
Add H100 workflow and status badge. by @loadams in #3754
Add an api in deepspeed engine for adjusting micro batch size during training by @kisseternity in #3773
Prevent hangs in CI during parallel run compilation by @mrwyattii in #2844
Revert "Prevent hangs in CI during parallel run compilation" by @jeffra in #3817
[Docs] chrome://tracing is deprecated by @keyboardAnt in #3805
Support model declaration in zero.Init context by @tohtana in #3592
Update zeropp.md by @samadejacobs in #3821
Reduce Unit Test Times (Part 1) by @mrwyattii in #3829
Re-enable GPT-J unit tests and refactor inference tests by @mrwyattii in #3618
Fix racing condition in GatheredParameters by @HeyangQin in #3819
zero/mics.py: use on_accelerator instead of cuda only by @guoyejun in #3806
Disable AMD test flows in YML by @loadams in #3847
Reduce Unit Test Time (Part 2) by @mrwyattii in #3838
[profiling]add show_straggler argument to log_summary() by @delock in #3579
checking process_group before merging bucket ranges (#3521) by @clumsy in #3577
scripts/check-torchcuda.py: add checking for tensor.is_cuda by @guoyejun in #3843
Zero3 Fix allreduce optimization for extra large tensor by @hablb in #3832
[zero] revert PR #3166, it disabled grad clip for bf16 by @jeffra in #3790
Fix transpose convolution FLOPS profiler (retrieval of out_channels) by @pinstripe-potoroo in #3834
Fix LoRA Fuse/Unfuse in Hybrid Engine by @sxjscience in #3563
Update pytorch-lightning version in CI by @mrwyattii in #3882
[Docs] MMEngine has integrated deepspeed. by @HAOCHENYE in #3879
Add FALCON Auto-TP Support by @RezaYazdaniAminabadi in #3640
Update apex installation to resolve apex's pyproject.toml issues. by @loadams in #3745
Extend HE-Lora test with Z3 support + Fix/add guard in HE for Z3 by @awan-10 in #3883
Separate ZeRO3 InflightParamRegistry for train and eval by @HeyangQin in #3884
Add GPTNeoX AutoTP support by @Yejing-Lai in #3778
Fix Meta Tensor checkpoint load for BLOOM models by @lekurile in #3885
fix error :Dictionary expression not allowed in type annotation Pylance by @digger-yu in #3708
Fix rnn flop profiler to compute flops instead of macs by @pinstripe-potoroo in #3833
Update workflows for merge queue by @mrwyattii in #3892
Avoid deprecation warnings in CHECK_CUDA by @Flamefire in #3854
Silence comm.py warning by @mrwyattii in #3893
Fix a typo of global variable in comm.py by @hipudding in #3852
[ROCm] Enable TestCUDABackward::test_backward unit tests by @rraminen in #3849
[profiling][mics]Fix some issues for log_summary(). by @ys950902 in #3899
fix "undefined symbol: curandCreateGenerator" for quantizer op by @jinzhen-lin in #3846
fix memory leak with zero-3 by @jeffra in #3903
fix some typo docs/ by @digger-yu in #3917
fix: change ==NONE to is under deepspeed/ by @digger-yu in #3923
Del comment deepspeed.zero.Init() can be used as a decorator by @hipudding in #3894
Remove the param.ds_tensor from print by @HeyangQin in #3928
Reduce Unit Test Times (Part 3) by @mrwyattii in #3850
Update zero_to_fp32.py - to support deepspeed_stage_1 by @PicoCreator in #3936
[docs] add xTrimoPGLM by @jeffra in #3940
Update Nvidia docker base image by @KaiChen1008 in #3930
Fix inference tutorial docs for checkpoints by @loadams in #3955
fix Megatron-DeepSpeed links by @conglongli in #3956
skip bcast when enable pp but pp_group_size=1 by @inkcherry in #3915
Use device_name instead of device index to support other device by @hipudding in #3933
Create accelerator for apple silicon GPU Acceleration by @NripeshN in #3907
fix(cpu_accelerator): 🐛 Convert LOCAL_SIZE to integer by @javsalgar in #3971

New Contributors

@straywarrior made their first contribution in #3664
@alito made their first contribution in #3720
@acforvs made their first contribution in #3768
@keyboardAnt made their first contribution in #3805
@pinstripe-potoroo made their first contribution in #3834
@HAOCHENYE made their first contribution in #3879
@Yejing-Lai made their first contribution in #3778
@Flamefire made their first contribution in #3854
@hipudding made their first contribution in #3852
@PicoCreator made their first contribution in #3936
@KaiChen1008 made their first contribution in #3930
@NripeshN made their first contribution in #3907
@javsalgar made their first contribution in #3971

Full Changelog: v0.9.4...v0.10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepSpeed v0.10.0

New features

What's Changed

New Contributors

Contributors