Releases · microsoft/mscclpp

23 Dec 19:01

Binyang2014

v0.6.0

2ef070e

MSCCL++ v0.6.0 Latest

Latest

Highlight

Improved NCCL API integration in MSCCL++ for better performance and usability
Enhanced execution plan-based executor in MSCCL++
Fixed several bugs to improve stability and reliability

What's Changed

Add support for different vector sizes in multimem instructions by @roshandathathri in #332
NCCL API Executor Integration by @caiomcbr in #331
Fix missing import in executor test by @yzygitzh in #334
bfloat16 support by @chhwang in #336
Dynamically load libibverbs by @caiomcbr in #337
Auto-tune vector sizes for NVLS allreduce6 by @roshandathathri in #338
Make ibverbs optional at compile time by @chhwang in #340
ProxyChannel Support in Executor by @caiomcbr in #342
Support executors to send packets over ProxyChannel by @caiomcbr in #344
Fix for ROCm 6.0 by @chhwang in #347
Fix bug for construct sempaphore by @Binyang2014 in #341
Add proxy channel related operations by @Binyang2014 in #351
Add CI for rocm by @Binyang2014 in #346
Tune threads per block for mscclpp executor by @Binyang2014 in #345
Fix NPKit exit event offset by @yzygitzh in #356
Use IB transport flags only when an IB device exists by @chhwang in #355
Update ROCm CI by @chhwang in #357
Fixing RegisterMemory Allocation for ProxyChannels by @caiomcbr in #353
Fix NCCL API bugs by @chhwang in #363
Perf optimization & support clipping by @chhwang in #364
Fix copyright messages by @chhwang in #367
[Doc] mscclpp docs by @Binyang2014 in #348
Executor AllGather In-Place Support by @caiomcbr in #365
Fix algo repo name by @Binyang2014 in #369
Update docker image for cuda12.4 by @Binyang2014 in #370
Fix in-place all-gather input buffer in executor_test by @yzygitzh in #372
[docs] fix quickstart link by @jeffra in #374
Add kernel-based verification for executor_test by @yzygitzh in #378
Lazily create the context stream by @chhwang in #381
Fixing Bug Const Offset in Execution Plan by @caiomcbr in #380
Fix light load bug by @Binyang2014 in #379
Small Adjust in Test Data AllGather at Executor Test by @caiomcbr in #384
Fix missing packet parameter for executor by @yzygitzh in #385
NVLS support for msccl++ executor by @Binyang2014 in #375
Fix typo by @Binyang2014 in #389
Improve CMake options by @chhwang in #376
Fixing Message Boundary AllReduce Fallback Code by @caiomcbr in #391
Fix mscclpp_benchmark by @Binyang2014 in #392
Add cross threadblock barrier by @Binyang2014 in #383
AllGather Executor Support in NCCL Interface by @caiomcbr in #393
Providing reduce-scatter test support by @caiomcbr in #390
Select algo according to json config by @Binyang2014 in #396
Add connection events for NPKit by @yzygitzh in #386
Revised ProxyChannel interfaces by @chhwang in #400
Setup pipeline for mscclpp over nccl by @Binyang2014 in #401
Exception Max Number Operation per Tb by @caiomcbr in #405
Reduce memory usage for scratch buffer by @Binyang2014 in #403
[Cherry-pick] Move pipeline to official org (#406) by @Binyang2014 in #416
[Cherry-pick] trigger ci for release branches (#426) by @Binyang2014 in #427
[Cherry-pick] Disable CuMemMap check for ROCm (#411) by @Binyang2014 in #424
[Cherry-pick] NVLS support for NCCL API (#410) by @Binyang2014 in #425
[Cherry-pick] Fix nccl-test failure issue (#421) by @Binyang2014 in #429

New Contributors

@jeffra made their first contribution in #374

Full Changelog: v0.5.2...v0.6.0

Contributors

jeffra, chhwang, and 4 other contributors

Assets 2

16 Jul 00:37

chhwang

v0.5.2

40cb196

MSCCL++ v0.5.2

What's Changed

Add C++ executor test by @chhwang in #304
Cumulative Updates by @Binyang2014 in #309
Add NPKit GPU event support by @yzygitzh in #310
Fix NPKit support for AMD by @yzygitzh in #312
Add "packet type" option for executor test by @Binyang2014 in #313
Add support for multicast reduce insruction by @roshandathathri in #316
Update quickstart.md by @angelica-moreira in #314
Simplify/improve barrier in AllReduce6 by @roshandathathri in #317
Support NCCL APIs by @caiomcbr in #319
Update allreduce_bench.py by @angelica-moreira in #318
Separate NPKit CPU timestamp access from different blocks for AMD platform by @yzygitzh in #321
AllReduce Kernel for Small Messages by @caiomcbr in #322
Resolve clang++ warnings by @chhwang in #325
Support to write packets via uint2 by @Binyang2014 in #327
Double buffering for NCCL APIs by @caiomcbr in #324
v0.5.2 by @chhwang in #328

New Contributors

@angelica-moreira made their first contribution in #314
@caiomcbr made their first contribution in #319

Full Changelog: v0.5.1...v0.5.2

Contributors

chhwang, Binyang2014, and 4 other contributors

Assets 2

26 May 21:32

chhwang

v0.5.1

cddffbc

MSCCL++ v0.5.1

What's Changed

Upgrade gtest by @chhwang in #300
Rename executor.cpp to executor_py.cpp by @chhwang in #301
Fix assert declaration & add a compile test by @chhwang in #303
Fix security issue by @Binyang2014 in #305
v0.5.1 by @chhwang in #308

Full Changelog: v0.5.0...v0.5.1

Contributors

chhwang and Binyang2014

Assets 2

04 May 23:53

chhwang

v0.5.0

9c2a960

MSCCL++ v0.5.0

What's Changed

Fix a typo name by @chhwang in #286
Add executor to execute schedule-plan file by @Binyang2014 in #283
Allow binding allocated memory to NVLS multicast pointer by @roshandathathri in #290
Seperate headers for GPU data types by @chhwang in #291
Refactoring NVLS interfaces by @chhwang in #293
Include GPU data types only for kernel code by @chhwang in #292
Ethernet support by @chhwang in #284
Resolve multi-nodes test failure issue by @Binyang2014 in #295
Move pipeline to Azure org by @Binyang2014 in #296
Optimized the execution kernel by @Binyang2014 in #294
Allow obtaining cuda stream handle from PyTorch stream when launching kernel by @aashaka in #297
v0.5.0 by @chhwang in #298

New Contributors

@roshandathathri made their first contribution in #290

Full Changelog: v0.4.3...v0.5.0

Contributors

chhwang, Binyang2014, and 2 other contributors

Assets 2

27 Mar 18:55

chhwang

v0.4.3

1a7cb98

MSCCL++ v0.4.3

What's Changed

Add optional prefix to installation paths by @chhwang in #235
Fix #235 by @chhwang in #239
Check nvidia_peermem during runtime by @chhwang in #234
Do not check value of __HIP_PLATFORM_AMD__ by @chhwang in #240
Fix crash in static variable deconstructor by @Binyang2014 in #238
Update interface to let user change fifo size by @Binyang2014 in #243
Mask each fields of the trigger by @chhwang in #244
Minor improvement on device syncer by @chhwang in #231
remove make pylib-copy command by @Binyang2014 in #249
Increase MSCCLPP_BITS_REGMEM_HANDLE to 9 by @aashaka in #251
Add putWithSignal() latency tests by @chhwang in #246
NVLS support. by @saeedmaleki in #250
Fix wrong offset calculation by @chhwang in #257
Fix NVLS support by @chhwang in #258
Allow MSCCL++ CommGroup to take PyTorch tensors in args by @aashaka in #255
Fix multi-nodes test failure by @Binyang2014 in #262
Allow semaphores and memory to be registered separately in ProxyService by @aashaka in #264
Remove cuda-python from project by @Binyang2014 in #245
Fix the comm.py for nvls by @saeedmaleki in #267
New packet format & optimizations by @chhwang in #256
Fix multi-node ci pipeline by @Binyang2014 in #272
add launch_bounds for mscclpp_test by @Binyang2014 in #273
Fix bootstrapping mechanism by @chhwang in #278
v0.4.3 by @chhwang in #279

New Contributors

@aashaka made their first contribution in #251

Full Changelog: v0.4.2...v0.4.3

Contributors

chhwang, Binyang2014, and 2 other contributors

Assets 2

20 Dec 12:25

chhwang

v0.4.2

f1605b7

MSCCL++ v0.4.2

What's Changed

Include cstdint in packet_device.hpp by @chhwang in #233
Fix & improve perf for ROCm by @chhwang in #232
v0.4.2 by @chhwang in #236

Full Changelog: v0.4.1...v0.4.2

Contributors

chhwang

Assets 2

06 Dec 02:14

chhwang

v0.4.1

c15a166

MSCCL++ v0.4.1

What's Changed

Fix performance downgrade issue & update doc by @Binyang2014 in #229
Add a documentation issue template by @chhwang in #230

Full Changelog: v0.4.0...v0.4.1

Contributors

chhwang and Binyang2014

Assets 2

24 Nov 09:09

chhwang

v0.4.0

351b95b

MSCCL++ v0.4.0

Add Python benchmark
Update documentation
Add ROCm support
Bug fixes

See details from #160.

Assets 2

11 Oct 14:37

chhwang

v0.3.0

8c0f9e8

MSCCL++ v0.3.0

Updated interfaces
Add Python bindings and interfaces
Add Python unit tests
Add more configurable parameters
Add a new single-node AllReduce kernel
Fix bugs

See details from #89.

Full Changelog: v0.2.0...v0.3.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlight

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

Communication Features and Interfaces

GPU-side communication interfaces (DeviceChannel)

Host-side interfaces

Transports support

Performance Optimization

Development Pipeline

Documents

Releases: microsoft/mscclpp

MSCCL++ v0.6.0

Highlight

What's Changed

New Contributors

Contributors

MSCCL++ v0.5.2

What's Changed

New Contributors

Contributors

MSCCL++ v0.5.1

What's Changed

Contributors

MSCCL++ v0.5.0

What's Changed

New Contributors

Contributors

MSCCL++ v0.4.3

What's Changed

New Contributors

Contributors

MSCCL++ v0.4.2

What's Changed

Contributors

MSCCL++ v0.4.1

What's Changed

Contributors

MSCCL++ v0.4.0

MSCCL++ v0.3.0

MSCCL++ v0.2.0

Communication Features and Interfaces

GPU-side communication interfaces (DeviceChannel)

Host-side interfaces

Transports support

Performance Optimization

Development Pipeline

Documents