Releases: microsoft/DeepSpeed
Releases · microsoft/DeepSpeed
DeepSpeed v0.3.10
v0.3.10 Release notes
Combined release notes since November 12th v0.3.1 release
- Various updates to torch.distributed initialization
- Transformer kernel updates
- Elastic training support (#602)
- NOTE: More details to come on this feature, currently still in initial piloting of this feature.
- Module replacement support #586
- NOTE: Will be used more and documented in the short-term to help automatically inject/replace deepspeed ops into client models.
- #528 removes dependencies psutil and cpufeature
- Various ZeRO 1 and 2 bug fixes and updates: #531, #532, #545, #548
- #543 backwards compatible checkpoints with older deepspeed v0.2 version
- Add static_loss_scale support to unfused optimizer #546
- Bug fix for norm calculation in absence of model parallel group #551
- Switch CI from azure pipelines to github actions
- Deprecate client ability to disable gradient reduction #552
- Bug fix for tracking optimizer step in cpu-adam when loading checkpoint #564
- Improved support for Ampere architecture #572, #570, #577, #578, #591, #642
- Fix potential random layout inconsistency issues in sparse attention modules #534
- Supported customizing kwargs for lr_scheduler #584
- Support deepspeed.initialize with dict configuration instead of arg #632
- Allow DeepSpeed models to be initialized with optimizer=None #469
Special thanks to our contributors in this release
@stas00, @gcooper-isi, @g-karthik, @sxjscience, @brettkoonce, @carefree0910, @Justin1904, @harrydrippin
DeepSpeed v0.3.1
Updates
- Efficient and robust compressed training through progressive layer dropping
- JIT compilation of C++/CUDA extensions
- Python-only install support, ~10x faster install time
- PyPI hosted installation via
pip install deepspeed
- Removed apex dependency
- Bug fixes for ZeRO-offload and CPU-Adam
- Transformer support for dynamic sequence length (#424)
- Linear warmup+decay lr schedule (#414)
DeepSpeed v0.3.0
New features
Software improvements
- Refactor codebase to make cleaner distinction between ops/runtime/zero/etc.
- Conditional Op builds
- Not all users should have to spend time building transformer kernels if they don't want to use them.
- To ensure DeepSpeed is portable in multiple environments some features require unique dependencies that not everyone will be able to or want to install.
- DeepSpeed launcher supports different backends in additional to pdsh such as Open MPI and MVAPICH.
DeepSpeed v0.2.0
DeepSpeed 0.2.0 Release Notes
Features
- ZeRO-1 with reduce scatter
- ZeRO-2
- Transformer kernels
- Various bug fixes and usability improvements
DeepSpeed v0.1.0
DeepSpeed 0.1.0 Release Notes
Features
- Distributed Training with Mixed Precision
- 16-bit mixed precision
- Single-GPU/Multi-GPU/Multi-Node
- Model Parallelism
- Support for Custom Model Parallelism
- Integration with Megatron-LM
- Memory and Bandwidth Optimizations
- Zero Redundancy Optimizer (ZeRO) stage 1 with all-reduce
- Constant Buffer Optimization (CBO)
- Smart Gradient Accumulation
- Training Features
- Simplified training API
- Gradient Clipping
- Automatic loss scaling with mixed precision
- Training Optimizers
- Fused Adam optimizer and arbitrary torch.optim.Optimizer
- Memory bandwidth optimized FP16 Optimizer
- Large Batch Training with LAMB Optimizer
- Memory efficient Training with ZeRO Optimizer
- Training Agnostic Checkpointing
- Advanced Parameter Search
- Learning Rate Range Test
- 1Cycle Learning Rate Schedule
- Simplified Data Loader
- Performance Analysis and Debugging