Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
v0.10.3: Patch release
New Features
What's Changed
- Add Mixed Precision ZeRO++ tutorial by @HeyangQin in #4241
- DeepSpeed-Chat Llama2/stability release by @awan-10 in #4240
- Update README.md by @awan-10 in #4244
- Pin Triton version to >=2.0.0 and <2.1.0 by @lekurile in #4251
- Allow modification of zero partitioned parameters by @tjruwase in #4192
- Checks for user injection policy by @satpalsr in #3052
- Add check that opening issues on CI failure requires schedule by @loadams in #4242
- Code Refactoring by @tosemml in #4262
- tolerating missing optimizer states for MoE [2nd attempt] by @clumsy in #4120
- Fix nv-inference/un-pin transformers by @loadams in #4269
- check for zero (empty) param groups in llama + hf/accelerate. by @awan-10 in #4270
- use
non_reentrant_checkpoint
fix requires_grad of input must be true for activation checkpoint layer in pipeline train. by @inkcherry in #4224 - The PostBackwardFunction class should be more clearly named to distinguish it from the PreBackwardFunction class. by @Crispig in #2548
- fix iteration timing used in autotuning when gradient_accumulation_steps > 1 by @cli99 in #2888
- Update README.md by @NinoRisteski in #4284
- update deepspeed to run with the most recent triton 2.1.0 by @stephen-youn in #4278
- Keep hpz secondary tensor in forward pass by @HeyangQin in #4288
- Support iterators with incompletely defined len functions by @codedecde in #2445
- AMD Kernel Compatibility Fixes by @cmikeh2 in #3180
- ZeRO-Inference refresh by @tjruwase in #4197
- fix user args parsing of string with spaces on runner by @YudiZh in #4265
- Update index.md by @NinoRisteski in #4297
New Contributors
- @tosemml made their first contribution in #4262
- @Crispig made their first contribution in #2548
- @NinoRisteski made their first contribution in #4284
- @codedecde made their first contribution in #2445
- @YudiZh made their first contribution in #4265
Full Changelog: v0.10.2...v0.10.3
v0.10.2: Patch release
What's Changed
- MP ZeRO++ by @HeyangQin in #3954
- do allgather only in shared optimizer states groups by @inkcherry in #4167
- Permit empty environment variables as unset in
setup.py
by @loadams in #4185 - enable autoTP for mpt in huggingface model hub without trust_remote_c… by @sywangyi in #4062
- Fix nv-nightly workflow by @mrwyattii in #4163
- Fix the path in tutorial by @kytimmylai in #4193
- Add unit test to check HF low_cpu_mem_usage_flag by @loadams in #4184
- Fix ZeRO parameter initialization for tensors with
requires_grad=True
by @XuehaiPan in #4138 - DeepSpeed Ulysses tutorial by @minjiaz in #4200
- Load z3 checkpoints for inference by @tjruwase in #4171
- DeepSpeed Ulysses release by @samadejacobs in #4198
- Deepspeed-Ulysses blog by @samadejacobs in #4201
- Ds ulysses news by @samadejacobs in #4202
- DS-Ulysses formating by @samadejacobs in #4204
- Update Ulyssess by @samadejacobs in #4205
- Update README.md by @samadejacobs in #4211
- Add Japanese blog of DS-Ulysses by @tohtana in #4209
- DeepSpeed Ulysses Chinese blog translation by @HeyangQin in #4210
- add ulysses blog index by @conglongli in #4215
- Add MuP optimizers by @mrwyattii in #2043
- Simplify Gradient Attribute Names by @jomayeri in #4214
- add meta onDevice support for LLAMA2 by @dc3671 in #4147
- Fixes timer error referenced in #4212 by @bjoernpl in #4213
- Fix pipline dataloader when batch elements contain tuple by @ghosthamlet in #565
- feat(activation_checkpointing): add
non_reentrant_checkpoint
to support inputs require no grad by @hughpu in #4118 - add npu support dtypes by @CurryRice233 in #4223
- Fix fused qkv sizing for bloom by @molly-smith in #4161
- added port argument for ssh by @Hiromasa-H in #4117
- Empty tensor size check by @jomayeri in #4186
- fix: linker issues in conda environments #3929 by @maximegmd in #4235
- Enable AMD MI200 and H100 to run on branches for testing by @loadams in #4238
- fix MegatronLayerPolicy to be compatible with the newest ParallelTransformerLayer by @dc3671 in #4236
- Enable hpz when running with torch.no_grad by @HeyangQin in #4232
New Contributors
- @kytimmylai made their first contribution in #4193
- @bjoernpl made their first contribution in #4213
- @Hiromasa-H made their first contribution in #4117
- @maximegmd made their first contribution in #4235
Full Changelog: v0.10.1...v0.10.2
v0.10.1: Patch release
What's Changed
- [docs] add zero++ paper link by @jeffra in #3974
- Avoid race condition with port selection in unit tests by @mrwyattii in #3975
- Remove duplicated inference unit tests by @mrwyattii in #3951
- Switch to torch.linalg.norm by @loadams in #3984
- Simplify chain comparisons, remove redundant parentheses by @digger-yu in #3912
- [CPU] Support HBM flatmode and fakenuma mode by @delock in #3918
- Fix checkpoint conversion when model layers share weights by @awaelchli in #3825
- fixing flops profiler formatting, units and precision by @clumsy in #3927
- Specify language=python in pre-commit hook by @wangruohui in #3994
- [CPU] Skip CPU support unimplemented error by @Yejing-Lai in #3633
- ZeRO Gradient Accumulation Dtype. by @jomayeri in #2847
- [CPU] Use allreduce_low_latency for AutoTP and implement low latency allreduce for CPU backend (single node) by @delock in #3919
- Re-enable skipped unit tests by @mrwyattii in #3939
- Make AMD/ROCm apex install to /blob to save test/compile time. by @loadams in #3997
- Option to exclude frozen weights for checkpoint save by @tjruwase in #3953
- Allow user to select name of .deepspeed_env by @loadams in #4006
- Silence backend warning by @mrwyattii in #4009
- Fix user arg parsing in single node deployment by @mrwyattii in #4007
- Specify triton 2.0.0 requirement by @mrwyattii in #4008
- Re-enable elastic training for torch 2+ by @loadams in #4010
- add /dev/shm size to ds_report by @jeffra in #4015
- Make Ascend NPU available by @hipudding in #3831
- RNNprofiler: fix gates size retrieval logic in _rnn_flops by @pinstripe-potoroo in #3921
- fix typo in SECURITY.md by @jstan327 in #4019
- add llama2 autoTP support in replace_module by @dc3671 in #4022
- [zero_to_fp32] 3x less cpu memory requirements by @stas00 in #4025
- [CPU] FusedAdam and CPU training support by @delock in #3991
- remove duplicate check for pp and zero stage by @inkcherry in #4033
- Pass missing positional arguments in
DeepSpeedHybridEngine.generate()
by @XuehaiPan in #4026 - Remove print of weight parameter in RMS norm by @puneeshkhanna in #4031
- Monitored Loss Calculations by @jomayeri in #4030
- fix(pipe): make pipe module
load_state_dir
non-strict-mode work by @hughpu in #4020 - polishing timers and log_dist by @clumsy in #3996
- Engine side fix for loading llama checkpoint fine-tuned with zero3 by @minjiaz in #3981
- fix: Remove duplicate word the by @digger-yu in #4051
- [Bug Fix] Fix comm logging for inference by @delock in #4043
- fix opt-350m shard loading issue in AutoTP by @sywangyi in #3600
- enable autoTP for MPT by @sywangyi in #3861
- autoTP for fused qkv weight by @inkcherry in #3844
- [CPU] Faster reduce kernel for SHM allreduce by @delock in #4049
- Multiple zero stage 3 related fixes by @tjruwase in #3886
- Fix deadlock when SHM based allreduce spin too fast by @delock in #4048
- [MiCS] [Bugfix] set self.save_non_zero_checkpoint=True only for first partition group by @zarzen in #3787
- add reproducible compilation environment by @fecet in #3943
- fix: remove unnessary
#
punct in the secondsed
command by @hughpu in #4061 - Refactor autoTP inference for HE by @molly-smith in #4040
- Fix transformers unit tests by @mrwyattii in #4079
- Fix Stable Diffusion Injection by @lekurile in #4078
- Spread layers more uniformly when using partition_uniform by @marcobellagente93 in #4053
- fix typo: change polciies to policies by @digger-yu in #4090
- update ut/doc for glm/codegen by @inkcherry in #4057
- zero_to_fp32 script adds support for tag argument by @EeyoreLee in #4089
- add type checker ignore by @EeyoreLee in #4102
- Fix generate config validation error on inference unit tests by @mrwyattii in #4107
- use correct ckpt path when base_dir not available by @polisettyvarma in #4101
- Disable z3 tracing profiler by @tjruwase in #4106
- Pass correct node size for ZeRO++ by @cmikeh2 in #4085
- add deepspeed chat arxiv report by @conglongli in #4110
- enable pipeline checkpoint loading mode by @leiwen83 in #3629
- Fix Issue 4083 by @jomayeri in #4084
- Add full list of DS_BUILD_* by @loadams in #4119
- Update nightly workflows to open an issue if CI fails by @loadams in #3952
- Update torch1.9 tests to 1.10 to match latest accelerate. by @loadams in #4126
- Handle PermissionError in os.chmod Call - Update engine.py by @M-Chris in #4139
- Generalize frozen weights unit test by @tjruwase in #4140
- Respect memory pinning config by @tjruwase in #4131
- Remove incorrect async-io library checking code. by @loadams in #4150
- Return nn.parameter type for weights and biases by @molly-smith in #4146
- Fixes #4151 by @saforem2 in #4152
- Handling for SIGTERM as well by @loadams in #4160
- Fix CI Badges by @mrwyattii in #4162
- Add DS-Chat CI workflow by @lekurile in #4127
- [CPU][Bugfix] Make uid and addr_port part of SHM name in CCL backend by @delock in #4115
- Add DSE branch input to nv-ds-chat by @lekurile in #4173
- Pin transformers by @mrwyattii in #4174
New Contributors
- @awaelchli made their first contribution in #3825
- @wangruohui made their first contribution in #3994
- @jstan327 made their first contribution in #4019
- @XuehaiPan made their first contribution in #4026
- @puneeshkhanna made their first contribution in #4031
- @hughpu made their first contribution in #4020
- @fecet made their first contribution in #3943
- @marcobellagente93 made their first contribution in #4053
- @polisettyvarma made their first contribution in #4101
- @leiwen83 made their first contribution in #3629
- @M-Chris made their first contribution in #4139
Full Changelog: v0.10.0...v0.10.1
DeepSpeed v0.10.0
New features
- ZeRO++: A leap in speed for LLM and chat model training with 4X less communication[English] [中文] [日本語]
- H100 support and testing w. FP8 using NVIDIA's TransformerEngine
What's Changed
- Documentation for DeepSpeed Accelerator Abstraction Interface by @delock in #3184
- FP8 unittest for H100 by @jomayeri in #3731
- Fix apex install bugs by @loadams in #3741
- Fix Autotuner get_gas_from_user_config by @straywarrior in #3664
- Include cublas error details when getting cublas handle fails by @jli in #3695
- fix hybrid engine mlp module by @tensor-tang in #3736
- Fix output transpose dimension bugs by @loadams in #3747
- remove UtilsBuilder load, use torch (un)flatten ops by @inkcherry in #3728
- add Chinese Zhihu social account by @conglongli in #3755
- Account for expert parameters when calculating the total number of pa… by @alito in #3720
- fix ccl_backend and residual_add problems by @dc3671 in #3642
- Fix url in getting-started guide (docs) by @acforvs in #3768
- Update deepspeed-chat/japanese/README.md by @eltociear in #3765
- Add H100 workflow and status badge. by @loadams in #3754
- Add an api in deepspeed engine for adjusting micro batch size during training by @kisseternity in #3773
- Prevent hangs in CI during parallel run compilation by @mrwyattii in #2844
- Revert "Prevent hangs in CI during parallel run compilation" by @jeffra in #3817
- [Docs]
chrome://tracing
is deprecated by @keyboardAnt in #3805 - Support model declaration in zero.Init context by @tohtana in #3592
- Update zeropp.md by @samadejacobs in #3821
- Reduce Unit Test Times (Part 1) by @mrwyattii in #3829
- Re-enable GPT-J unit tests and refactor inference tests by @mrwyattii in #3618
- Fix racing condition in GatheredParameters by @HeyangQin in #3819
- zero/mics.py: use on_accelerator instead of cuda only by @guoyejun in #3806
- Disable AMD test flows in YML by @loadams in #3847
- Reduce Unit Test Time (Part 2) by @mrwyattii in #3838
- [profiling]add show_straggler argument to log_summary() by @delock in #3579
- checking process_group before merging bucket ranges (#3521) by @clumsy in #3577
- scripts/check-torchcuda.py: add checking for tensor.is_cuda by @guoyejun in #3843
- Zero3 Fix allreduce optimization for extra large tensor by @hablb in #3832
- [zero] revert PR #3166, it disabled grad clip for bf16 by @jeffra in #3790
- Fix transpose convolution FLOPS profiler (retrieval of out_channels) by @pinstripe-potoroo in #3834
- Fix LoRA Fuse/Unfuse in Hybrid Engine by @sxjscience in #3563
- Update pytorch-lightning version in CI by @mrwyattii in #3882
- [Docs] MMEngine has integrated deepspeed. by @HAOCHENYE in #3879
- Add FALCON Auto-TP Support by @RezaYazdaniAminabadi in #3640
- Update apex installation to resolve apex's pyproject.toml issues. by @loadams in #3745
- Extend HE-Lora test with Z3 support + Fix/add guard in HE for Z3 by @awan-10 in #3883
- Separate ZeRO3 InflightParamRegistry for train and eval by @HeyangQin in #3884
- Add GPTNeoX AutoTP support by @Yejing-Lai in #3778
- Fix Meta Tensor checkpoint load for BLOOM models by @lekurile in #3885
- fix error :Dictionary expression not allowed in type annotation Pylance by @digger-yu in #3708
- Fix rnn flop profiler to compute flops instead of macs by @pinstripe-potoroo in #3833
- Update workflows for merge queue by @mrwyattii in #3892
- Avoid deprecation warnings in
CHECK_CUDA
by @Flamefire in #3854 - Silence comm.py warning by @mrwyattii in #3893
- Fix a typo of global variable in comm.py by @hipudding in #3852
- [ROCm] Enable TestCUDABackward::test_backward unit tests by @rraminen in #3849
- [profiling][mics]Fix some issues for log_summary(). by @ys950902 in #3899
- fix "undefined symbol: curandCreateGenerator" for quantizer op by @jinzhen-lin in #3846
- fix memory leak with zero-3 by @jeffra in #3903
- fix some typo docs/ by @digger-yu in #3917
- fix: change ==NONE to is under deepspeed/ by @digger-yu in #3923
- Del comment deepspeed.zero.Init() can be used as a decorator by @hipudding in #3894
- Remove the param.ds_tensor from print by @HeyangQin in #3928
- Reduce Unit Test Times (Part 3) by @mrwyattii in #3850
- Update zero_to_fp32.py - to support deepspeed_stage_1 by @PicoCreator in #3936
- [docs] add xTrimoPGLM by @jeffra in #3940
- Update Nvidia docker base image by @KaiChen1008 in #3930
- Fix inference tutorial docs for checkpoints by @loadams in #3955
- fix Megatron-DeepSpeed links by @conglongli in #3956
- skip bcast when enable pp but pp_group_size=1 by @inkcherry in #3915
- Use device_name instead of device index to support other device by @hipudding in #3933
- Create accelerator for apple silicon GPU Acceleration by @NripeshN in #3907
- fix(cpu_accelerator): 🐛 Convert LOCAL_SIZE to integer by @javsalgar in #3971
New Contributors
- @straywarrior made their first contribution in #3664
- @alito made their first contribution in #3720
- @acforvs made their first contribution in #3768
- @keyboardAnt made their first contribution in #3805
- @pinstripe-potoroo made their first contribution in #3834
- @HAOCHENYE made their first contribution in #3879
- @Yejing-Lai made their first contribution in #3778
- @Flamefire made their first contribution in #3854
- @hipudding made their first contribution in #3852
- @PicoCreator made their first contribution in #3936
- @KaiChen1008 made their first contribution in #3930
- @NripeshN made their first contribution in #3907
- @javsalgar made their first contribution in #3971
Full Changelog: v0.9.4...v0.10.0
v0.9.5: Patch release
What's Changed
- Documentation for DeepSpeed Accelerator Abstraction Interface by @delock in #3184
- FP8 unittest for H100 by @jomayeri in #3731
- Fix apex install bugs by @loadams in #3741
- Fix Autotuner get_gas_from_user_config by @straywarrior in #3664
- Include cublas error details when getting cublas handle fails by @jli in #3695
- fix hybrid engine mlp module by @tensor-tang in #3736
- Fix output transpose dimension bugs by @loadams in #3747
- remove UtilsBuilder load, use torch (un)flatten ops by @inkcherry in #3728
- add Chinese Zhihu social account by @conglongli in #3755
- Account for expert parameters when calculating the total number of pa… by @alito in #3720
- fix ccl_backend and residual_add problems by @dc3671 in #3642
- Fix url in getting-started guide (docs) by @acforvs in #3768
- Update deepspeed-chat/japanese/README.md by @eltociear in #3765
- Add H100 workflow and status badge. by @loadams in #3754
- Zero++ tutorial PR by @HeyangQin in #3783
- [Fix] _conv_flops_compute when padding is a str and stride=1 by @zhiruiluo in #3169
- fix interpolate flops compute by @cli99 in #3782
- use
Flops Profiler
to testmodel.generate()
by @CaffreyR in #2515 - [zero] revert PR #3611 by @jeffra in #3786
New Contributors
- @straywarrior made their first contribution in #3664
- @alito made their first contribution in #3720
- @acforvs made their first contribution in #3768
- @zhiruiluo made their first contribution in #3169
- @CaffreyR made their first contribution in #2515
Full Changelog: v0.9.4...v0.9.5
v0.9.4: Patch release
What's Changed
- [MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding by @zarzen in #3440
- fix some typo by @digger-yu in #3675
- Use logger in accelerator by @tjruwase in #3682
- Update README to add ICS'23 paper on Tensor Parallel MoEs by @siddharth9820 in #3687
- non-JIT build fix on ROCm by @rraminen in #3638
- Fix local rank mismatch error when training on nodes with different number of GPUs by @byungsoo-oh in #3409
- Correct world_size/backend for mpi by @abhilash1910 in #3694
- Fix incorrectly formatted f string in hostfile checking by @loadams in #3698
- fix typo name of hybrid engine func by @tensor-tang in #3689
- Revert "fix typo name (#3689)" by @loadams in #3702
- Fix gpt-j inference issue by @RezaYazdaniAminabadi in #3639
- change partititon_name to partition_name by @digger-yu in #3700
- Fix unit test typo in tests/unit/ops/transformer/inference by @mrwyattii in #3697
- Small tweak on cuda version mismatch documentation by @jli in #3706
- DeepSpeed overview in Japanese by @conglongli in #3709
- zero3 performance optimizations by @hablb in #3622
- Fix typo in name of hybrid engine function by @loadams in #3704
- Increase tensor creator coverage by @tjruwase in #3684
- [Bugfix][CPU] Remove C++ version in CPU OpBuilder by @delock in #3643
- Single Node is using unreferenced pdsh kill cmd while terminating by @abhilash1910 in #3730
- Update Dockerfile with newer cuda and torch. by @loadams in #3716
New Contributors
- @byungsoo-oh made their first contribution in #3409
- @abhilash1910 made their first contribution in #3694
- @tensor-tang made their first contribution in #3689
- @jli made their first contribution in #3706
Full Changelog: v0.9.3...v0.9.4
v0.9.3: Patch release
What's Changed
- Enable auto TP policy for llama model by @jianan-gu in #3170
- Allow users to use mis-matched CUDA versions by @mrwyattii in #3436
- Hybrid Engine Refactor and Llama Inference Support by @cmikeh2 in #3425
- add sharded checkpoint loading for AutoTP path to reduce the peak mem… by @sywangyi in #3102
- launcher/multinode_runner.py: mapping env variables by @YizhouZ in #3372
- Update automatic-tensor-parallelism.md by @sywangyi in #3198
- Build: Update license in setup by @PabloEmidio in #3484
- Doc corrections by @goodship1 in #3435
- Fix spelling errors in comments and documents by @digger-yu in #3486
- Fix spelling error in function GetMaxTokenLength() by @luliyucoordinate in #3482
- Fix a type error on bf16+Pipeline Parallelism by @ys950902 in #3441
- Fix spelling errors in DeepSpeed codebase by @digger-yu in #3494
- fix spelling error with docs/index.md by @digger-yu in #3443
- delete the line to keep user_zero_stages by @MrZhengXin in #3473
- Update Inference Engine checkpoint loading + meta tensor assertions by @lekurile in #2940
- fix regression in shard checkpoint loading in AutoTP Path caused by qkv_copy() is deleted and add UT case for shard checkpoint loading in AutoTP by @sywangyi in #3457
- Add snip_momentum structured pruning which supports higher sparse ratio by @ftian1 in #3300
- Update README.md by @goodship1 in #3504
- Hybrid Engine Fix Llama by @lekurile in #3505
- fix spelling error with deepspeed/runtime/ by @digger-yu in #3509
- Skip autoTP if tp_size is 1 by @molly-smith in #3449
- Changing monitor loss to aggregate loss over gradient accumulation steps by @jomayeri in #3428
- change actions/checkout@v2 to v3 by @digger-yu in #3526
- fix typo with docs/ by @digger-yu in #3523
- Doc updates by @goodship1 in #3520
- Fix bug in Hybrid Engine by @mrwyattii in #3497
- Fix wrong passing of offload_optimizer_config to DeepSpeedZeRoOffload by @mmhab in #3420
- Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2 by @YizhouZ in #2999
- share inflight registry between PartitionedParameterCoordinators by @HeyangQin in #3462
- Syncing FusedAdam with new Apex features by @jomayeri in #3434
- fix typo in comments with deepspeed/ by @digger-yu in #3537
- [ROCm] Hip headers fix by @rraminen in #3532
- [CPU] Support Intel CPU inference by @delock in #3041
- Clone tensors to avoid torch.save bloat by @tjruwase in #3348
- Fix attribute error when loading FusedAdamBuilder() by @rraminen in #3527
- fix typo by @inkcherry in #3559
- Fixing bf16 test by @jomayeri in #3551
- Fix Hybrid Engine for BLOOM by @lekurile in #3580
- Fix op_builder against PyTorch nightly by @malfet in #3596
- data efficiency bug fix, avoid invalid range step size by @conglongli in #3609
- DS init should not broadcast or move zero.Init models by @tjruwase in #3611
- Expose Consecutive Hysteresis to Users by @Quentin-Anthony in #3553
- Align InferenceEngine to store ms in _model_times by @HolyFalafel in #3501
- AISC launcher fixes by @jeffra in #3637
- stage3.py: do not scale if gradient_predivide_factor is 1.0 by @guoyejun in #3630
- Add Ascend NPU accelerator support by @CurryRice233 in #3595
- Skip tests on docs-only changes by @mrwyattii in #3651
- Update megatron.md by @wjessup in #3641
- Typo Correction by @MicahZoltu in #3621
- deepspeed/comm/comm.py: fix typo of warning message by @guoyejun in #3636
- Fix RuntimeError when using ZeRO Stage3 with mpu: #3564 by @eggiter in #3565
- Allow dict datatype for checkpoints (inference) by @mrwyattii in #3007
- fix typo with deepspeed/ by @digger-yu in #3547
- flops_profiler: add option recompute_fwd_factor for the case of activation c… by @guoyejun in #3362
- fix typo deepspeed/runtime by @digger-yu in #3663
- Refactor check_enabled root validator in DeepSpeedMonitorConfig by @bgr8 in #3616
New Contributors
- @jianan-gu made their first contribution in #3170
- @YizhouZ made their first contribution in #3372
- @PabloEmidio made their first contribution in #3484
- @luliyucoordinate made their first contribution in #3482
- @ys950902 made their first contribution in #3441
- @MrZhengXin made their first contribution in #3473
- @ftian1 made their first contribution in #3300
- @mmhab made their first contribution in #3420
- @malfet made their first contribution in #3596
- @HolyFalafel made their first contribution in #3501
- @CurryRice233 made their first contribution in #3595
- @wjessup made their first contribution in #3641
- @MicahZoltu made their first contribution in #3621
- @eggiter made their first contribution in #3565
- @bgr8 made their first contribution in #3616
Full Changelog: v0.9.2...v0.9.3
v0.9.2: Patch release
What's Changed
- MiCS implementation by @zarzen in #2964
- Fix formatting by @mrwyattii in #3343
- [ROCm] Hipify cooperative_groups headers by @rraminen in #3323
- Diffusers 0.15.0 bug fix by @molly-smith in #3345
- Print default values for DeepSpeed --help by @mrwyattii in #3347
- add bf16 cuda kernel support by @dc3671 in #3092
- README.md: Update MosaicML docs link by @kobindra in #3344
- hybrid_engine: check tuple size when fusing lora params by @adammoody in #3311
- fix mpich launcher issue in multi-node by @sywangyi in #3078
- Update DS-Chat issue template by @mrwyattii in #3368
- add deepspeed chat blog links, add tags by @conglongli in #3369
- Fix redundant shared_params in zero_to_fp32.py by @ShijieZZZZ in #3149
- fixing default communication_data_type for bfloat16_enabled and docs by @clumsy in #3370
- Auto TP Tutorial with T5 Example by @molly-smith in #2962
- stage_1_and_2.py: do gradient scale only for fp16 by @guoyejun in #3166
- Fix memory leak in zero2 contiguous gradients by @hablb in #3306
- remove megatron-lm, no longer pip installable by @jeffra in #3389
- Fix pipeline module evaluation when contiguous activation checkpoin… by @hablb in #3005
- doc updates by @goodship1 in #3415
- Save tensors in context of memory_efficient_linear by @tohtana in #3413
- Add HE support for the rest of model containers by @RezaYazdaniAminabadi in #3191
- Update PyTorch Lightning/DeepSpeed examples links by @loadams in #3424
- Fix
PipelineEngine.eval_batch
result by @nrailgun in #3316 - OPT Activation Function Hotfix by @cmikeh2 in #3400
- Add ZeRO 1 support to PP for BF16. by @jomayeri in #3399
- [zero_to_fp32] fix shared param recovery by @stas00 in #3407
- Adagrad support in ZeRO by @jomayeri in #3401
- Update 2020-09-09-sparse-attention.md by @goodship1 in #3432
New Contributors
- @dc3671 made their first contribution in #3092
- @kobindra made their first contribution in #3344
- @hablb made their first contribution in #3306
- @nrailgun made their first contribution in #3316
Full Changelog: v0.9.1...v0.9.2
v0.9.1: Patch release
What's Changed
- Update DS-Chat docs for v0.9.0 by @mrwyattii in #3216
- Update DeepSpeed-Chat docs with latest changes to scripts by @mrwyattii in #3219
- Nested zero.Init() and dynamically defined model class by @tohtana in #2989
- Update torch version check in building sparse_attn by @loadams in #3152
- Fix for Stable Diffusion by @mrwyattii in #3218
- [update] reference in cifar-10 by @dtunai in #3212
- [fp16/doc] correct initial_scale_power default value by @stas00 in #3275
- update link to PL docs by @Borda in #3237
- fix typo in autotuner.py by @eltociear in #3269
- improving int4 asymmetric quantization accuracy by @HeyangQin in #3190
- Update install.sh by @digger-yu in #3270
- Fix cupy install version detection by @mrwyattii in #3276
- [ROCm] temporary workaround till __double2half support enabled in HIP by @bmedishe in #3236
- Fix pydantic and autodoc_pydantic version to <2.0.0 until support is added. by @loadams in #3290
- Add contribution images to readme by @digger-yu in #3282
- remove
torch.cuda.is_available()
check when compiling ops by @jinzhen-lin in #3085 - Update MI200 workflow to install apex with changes from pip by @loadams in #3294
- Add pre-compiling ops test by @loadams in #3277
- Update README.md by @digger-yu in #3315
- Update Dockerfile to use python 3.6 specifically by @bobowwb in #3298
- zero3 checkpoint frozen params by @tjruwase in #3205
- Fix for dist not being initialized when constructing main config by @mrwyattii in #3324
- Fix missing scale attributes for GPTJ by @cmikeh2 in #3256
- Explicitly check for OPT activation function by @cmikeh2 in #3278
New Contributors
- @dtunai made their first contribution in #3212
- @Borda made their first contribution in #3237
- @digger-yu made their first contribution in #3270
- @bmedishe made their first contribution in #3236
- @jinzhen-lin made their first contribution in #3085
- @bobowwb made their first contribution in #3298
Full Changelog: v0.9.0...v0.9.1
DeepSpeed v0.9.0
New features
What's Changed
- [docs] add MCR-DL paper to readme/docs by @Quentin-Anthony in #3066
- Several fixes to unblock CI by @loadams in #3047
- Assert mp_size is factor of model dimensions by @molly-smith in #2891
- [CI] follow-up fixes by @jeffra in #3072
- fix return prev key and value , added strides to from_blob by @mzusman in #2828
- Remove bf16 from inference config dtye enum by @molly-smith in #3010
- Softmax Scheduling Cleanup by @cmikeh2 in #3046
- Fix nebula in save_16bit_model issue by @FreyaRao in #3023
- Allow lists by @satpalsr in #3042
- Goodbye Torch 1.8 by @mrwyattii in #3082
- Empty ZeRO3 partition cache by @tjruwase in #3060
- pre-commit check for torch.cuda in code by @delock in #2981
- Move cuda check into utils by @loadams in #3074
- update yapf version and style settings by @jeffra in #3098
- Fix comms benchmark import issues and support MPI/slurm launching by @Quentin-Anthony in #2932
- Disable Stage 1&2 CPUAdam pathways by @mrwyattii in #3097
- ♻️ replace deprecated functions for communication by @mayank31398 in #2995
- Make fp32 default communication data type by @tjruwase in #2970
- Update DeepSpeed copyright license to Apache 2.0 by @mrwyattii in #3111
- Add Full Apache License by @mrwyattii in #3119
- VL MoE Blog by @yaozhewei in #3120
- Update SD triton version in requirements-sd.txt by @lekurile in #3135
- Fix launch issue by @tjruwase in #3137
- Fix CI badges by @mrwyattii in #3138
- Optimize Softmax Kernel by @molly-smith in #3112
- Use generic O_DIRECT by @tjruwase in #3115
- Enable autoTP for bloom by @sywangyi in #3035
- [cleanup] remove
pass
calls where they aren't needed by @stas00 in #2826 - [ci]
nv-transformers-v100
- use the same torch version as transformers CI by @stas00 in #3096 - Fixes code and tests skipping/asserting incorrectly on torch 2+. by @loadams in #3136
- fix example symlink about DeepSpeed+AzureML by @EeyoreLee in #3127
- Remove Extra Bracket by @VHellendoorn in #3101
- Recover shared parameters by @ShijieZZZZ in #3033
- Fix for Diffusers 0.14.0 by @molly-smith in #3142
- Fix copyright check, add copyright replace script by @mrwyattii in #3141
- Update curriculum-learning.md by @goodship1 in #3031
- Remove benchmark code by @mrwyattii in #3157
- fixing a bug in CPU Adam and Adagrad by @xiexbing in #3109
- op_builder: conditionally compute relative path for hip compiled files by @adammoody in #3095
- zero.Init() should pin params in GPU memory as requested by @tjruwase in #2953
- deepspeed/runtime/utils.py: reset_peak_memory_stats when empty cache by @guoyejun in #2803
- Add DeepSpeed-Chat Blogpost by @awan-10 in #3185
- [docs] add run command for 13b by @awan-10 in #3187
- add news item. by @awan-10 in #3188
- DeepSpeed Chat by @tjruwase in #3186
- Fix references to figures by @tohtana in #3189
- Fix typo by @zhouzaida in #3183
- Fix typo by @dawei-wang in #3164
- Chatgpt chinese blog by @yaozhewei in #3193
- Add Japanese version of ChatGPT-like pipeline blog by @tohtana in #3194
- fix hero figure by @conglongli in #3199
- feat: Add support for
NamedTuple
when sharding parameters [#3029] by @alexandervaneck in #3037 - fix license badge by @conglongli in #3200
- Update AMD workflows by @loadams in #3179
- [CPU support] Optionally bind each rank to different cores on host by @delock in #2881
New Contributors
- @mzusman made their first contribution in #2828
- @FreyaRao made their first contribution in #3023
- @sywangyi made their first contribution in #3035
- @EeyoreLee made their first contribution in #3127
- @VHellendoorn made their first contribution in #3101
- @goodship1 made their first contribution in #3031
- @zhouzaida made their first contribution in #3183
- @dawei-wang made their first contribution in #3164
- @alexandervaneck made their first contribution in #3037
Full Changelog: v0.8.3...v0.9.0