Download Latest Version NVIDIA Megatron Core 0.13.0 source code.tar.gz (9.3 MB)
Email in envelope

Get an email when there's a new version of Megatron

Home / core_v0.13.0
Name Modified Size InfoDownloads / Week
Parent folder
NVIDIA Megatron Core 0.13.0 source code.tar.gz 2025-07-25 9.3 MB
NVIDIA Megatron Core 0.13.0 source code.zip 2025-07-25 10.2 MB
README.md 2025-07-25 7.4 kB
Totals: 3 Items   19.6 MB 1
  • Features
  • Inference
    • Add async support for DynamicInferenceEngine (MR !3187)
    • Pad input tensors and enable FP8 weights for FP8 inference (MR !3341)
    • Force inference to always gather logits with tensor parallelism (MR !3442)
    • Multi batch size CUDA Graphs for Dynamic Inference (MR !3402)
  • Post-training
    • ModelOpt updates (MR !3268)
    • Add speculative decoding AR validation feature
    • Add DeepSeek and Qwen model configs
  • Performance
    • ModelCommProcessGroup integration (MR !3391)
    • Add HyperCommGrid: N-Dimensional Communication Grid for Model Parallelism (MR !3398)
    • Flexible creation and management of communication groups
    • Add support for Spike No More embedding initializations and weight decay skipping (MR !3500)
  • Model support
    • Add MiMo video VLM train example ([MR !3543)
    • Add AVLM for MIMO ([MR !3624)
  • Ease of use
    • Add uv support for source installs (MR !3615)
    • Automated weekly prereleases (MR !3574)
  • Bug fixes
  • Use mscale_all_dim for softmax_factor (MR !2800)
  • Fix FP8 param blockwise scaling unit test (MR !3480)
  • Fix unit test blockwise scaling (MR !3491)
  • Optimize prefill for token-less requests (MR !3499)
  • Add default values for Fp8Padding and Fp8Unpadding (MR !3501)
  • Fix CUDA graph logic for flexible pp layout (MR !3505)
  • Load FP8 models with strict=False (MR !3508)
  • Skip rope check for torch \< 1.4.0 (MR !3528)
  • Disable Apex tests for stability (MR !3539)
  • Fix typo in parallel_state expert parallelism (MR !3548)
  • Guard modelopt on macOS (MR !3549)
  • Retry on CUDA function failure (MR !3554)
  • Fix NCCL mem pool creation error (MR !3557)
  • Fix get_rotary_seq_len return type (MR !3559)
  • Retry on CUDA function failure (MR !3560)
  • Fix NCCL allocator attribute error (MR !3565)
  • Ensure multi-prompt inference works (MR !3568)
  • Fix MD5 on FIPS systems (MR !3577)
  • Fixes dynamic context and inference bugs (MR !3582)
  • Fix TE version for interleaved fused RoPE (MR !3586)
  • Fix MTP with MoE and TP logging (MR !3594)
  • Guard TE import fix (MR !3596)
  • Add assertion for NCCL UB case (MR !3599)
  • Remove Encoder PP related Functions (MR !3604)
  • Fix segfaults in tests (MR !3605)
  • Fix TE error in distributed optimizer (MR !3625)
  • Remove redundant barrier in checkpoint flow (MR !3626)
  • Support VPP MTP, fix logging (MR !3630)
  • Retry mechanism for free(): invalid pointer errors (MR !3632)
  • Fix test_replication.py issues (MR !3633)
  • Fix typo in parallel_state (MR !3634)
  • Fix CUDA graph logic determination (MR !3635)
  • Fix TE installation error (MR !3636)
  • Ensure correct sharding type in local tests (MR !3643)
  • Fix cudagraphed backward buffer reuse for last layer (MR !3645)
  • Set default for packed_seq_params in get_rotary_seq_len (MR !3651)
  • Fix dynamic example script errors (MR !3653)
  • Guard TE import fix (MR !3666)
  • Known issues
Source: README.md, updated 2025-07-25