Megatron - Browse /core_v0.16.0 at SourceForge.net

The interactive file manager requires Javascript. Please enable it or use sftp or scp.
You may still browse the files here.

Name	Modified	Size	InfoDownloads / Week
Parent folder
NVIDIA Megatron Core 0.16.0 source code.tar.gz	2026-02-26	8.4 MB	0
NVIDIA Megatron Core 0.16.0 source code.zip	2026-02-26	10.0 MB	2
README.md	2026-02-26	42.4 kB	0
Totals: 3 Items		18.4 MB	2

Changelog Details

- ci: Fix copyright checker by @ko3n1g :: PR: [#1893] - chore: Add codeowners by @ko3n1g :: PR: [#1897] - ci: Extend queue-manager for dev branch by @ko3n1g :: PR: [#1906] - ci: Move test optimizer into its own bucket by @ko3n1g :: PR: [#1909] - ci: Configure cherrypick bot by @ko3n1g :: PR: [#1925] - Ci approve dev by @ko3n1g :: PR: [#1933] - ci: Update nightly schedule by @ko3n1g :: PR: [#1934] - ci: Bump pre-flight for runs on main/dev by @ko3n1g :: PR: [#1935] - ci: Allow skipping on main by @ko3n1g :: PR: [#1936] - Ko3n1g/ci/pr template community bot by @ko3n1g :: PR: [#1937] - ci: More granular unit tests buckets by @ko3n1g :: PR: [#1932] - Add sequence packing to RL by @tdene :: PR: [#1911] - chore: Update template by @ko3n1g :: PR: [#1939] - chore: Add description about who can merge by @ko3n1g :: PR: [#1940] - Ko3n1g/ci/fix main on eos by @ko3n1g :: PR: [#1938] - Ko3n1g/ci/internal mrs by @ko3n1g :: PR: [#1942] - ci: Fix branch of approval bot by @ko3n1g :: PR: [#1944] - ci: Approvalbot for other branches by @ko3n1g :: PR: [#1947] - ci(fix): Approval bot by @ko3n1g :: PR: [#1949] - Ko3n1g/ci/sync branches by @ko3n1g :: PR: [#1956] - Ko3n1g/ci/add milestone by @ko3n1g :: PR: [#1951] - Remove M-FSDP testing under LTS environment by @shjwudp :: PR: [#1959] - ci: Run on push to release branch by @ko3n1g :: PR: [#1960] - Fix typo in rl section of CODEOWNERS by @tdene :: PR: [#1968] - ci: Update copyright checker by @ko3n1g :: PR: [#1973] - Ko3n1g/ci/auto reminder GitHub by @ko3n1g :: PR: [#1955] - ci(fix): `Run tests` label by @ko3n1g :: PR: [#1970] - Make `get_asyncio_loop` safe to use repeatedly by @tdene :: PR: [#1990] - chore: Update codeowners by @ko3n1g :: PR: [#2012] - zarr soft deprecation by @dimapihtar :: PR: [#2004] - Deduplicate dynamic engine + coordinator. by @lmcafee-nvidia :: PR: [#1981] - Update symmetric registration interface to sync-up with upstream pytorch change by @youngeunkwon0405 :: PR: [#1924] - Safely access state dict args in load ckpt by @maanug-nv :: PR: [#1957] - Allow mixed-batch sampling in dynamic inference by @tdene :: PR: [#1927] - Stop Nemo_CICD_Test from failing in forks by @tdene :: PR: [#2024] - Clean up dynamic inference step by @tdene :: PR: [#1992] - ci: Auto-update copy-pr-bot vetters by @ko3n1g :: PR: [#1850] - ci: Fix build-push-wheel workflow by @ko3n1g :: PR: [#2022] - ci: Enable integration tests by @ko3n1g :: PR: [#2023] - chore: Update tooling for interactive jobs by @ko3n1g :: PR: [#2032] - Have datasets account for tokenizers which incorrectly define PAD by @tdene :: PR: [#2017] - revert(hotfix): ci: trustees_override by @ko3n1g :: PR: [#2041] - add missing warnings import in model parallel config by @yashaswikarnati :: PR: [#2039] - Reduce-scatter implementation with FP32 accumulation by @deepakn94 :: PR: [#1967] - ci(fix): Workflows on `main` by @ko3n1g :: PR: [#2045] - build: Bump modelopt by @ko3n1g :: PR: [#2046] - Remove TestCaptureFreezeGC unit test. by @lmcafee-nvidia :: PR: [#1978] - ci: Add multi-approval action by @ko3n1g :: PR: [#2051] - Ko3n1g/ci/test iteration time by @ko3n1g :: PR: [#2067] - Allow inference test throughput to vary by 10% by @mathemakitten :: PR: [#2070] - chore: Fix autoformatter by @ko3n1g :: PR: [#2073] - ci(hotfix): Bypass approvalbot in merge-queue by @ko3n1g :: PR: [#2082] - chore: Update local tooling by @ko3n1g :: PR: [#2066] - Add extra RL files by @tdene :: PR: [#2077] - Prevent summary jobs from running in forks by @tdene :: PR: [#2083] - ci: Fix test scope by @ko3n1g :: PR: [#2091] - Refactor the attention metadata into separate classes by @kanz-nv :: PR: [#2001] - Guard against incorrectly using MoE prefill graphs by @tdene :: PR: [#2030] - Run mr-slim tests in lightweight-mode by @chtruong814 :: PR: [#2106] - Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: [#1977] - chore: Reenable trustees by @ko3n1g :: PR: [#2108] - Ko3n1g/chore/update release settings by @ko3n1g :: PR: [#2097] - ci(fix): Changeset of copyright checker by @ko3n1g :: PR: [#2110] - Remove unnecessary check on rotary_pos_cos by @santhnm2 :: PR: [#2003] - (Reverted) Inference | Lazy compile UVM allocator. by @lmcafee-nvidia :: PR: [#2125] - Refactor Attention Metadata to Separate Classes by @kanz-nv :: PR: [#2112] - Refactor model_provider to model_builder format for ModelOpt examples by @AAnoosheh :: PR: [#2107] - wandb Inference stats logging by @wdykas :: PR: [#2026] - Make `PipelineParallelLayout` always return str from ` __repr__` by @ananthsub :: PR: [#2055] - Add flash_attn_3 as first option for FA3 import by @santhnm2 :: PR: [#2010] - Add debugging hint for case when cudagraphs are created but no matching runner is found by @mathemakitten :: PR: [#2129] - ci: LTS container by @ko3n1g :: PR: [#2133] - Fix param init by @cuichenx :: PR: [#2033] - Hotfix to unit tests on hopper FA3 by @tdene :: PR: [#2143] - Add BytesIO to safe_globals by @tdene :: PR: [#2074] - add deprecation warning for legacy tokenizer system by @dimapihtar :: PR: [#2145] - replay: ci: Bump LTS container by @ko3n1g :: PR: [#2157] - Hotfix to unit tests on hopper FA3 (bis) by @tdene :: PR: [#2179] - Fix has_modelopt_state() for native Torch checkpoint format by @AAnoosheh :: PR: [#2160] - chore: Remove codeowners by @ko3n1g :: PR: [#2175] - Fix FP8 inference with sequence parallelism by @santhnm2 :: PR: [#2009] - Replace ModelOpt generation server by @AAnoosheh :: PR: [#2147] - Add hybrid model support for dynamic inference engine by @santhnm2 :: PR: [#1907] - Async task and event loop safety in Megatron Core by @tdene :: PR: [#2025] - Rename skip_prompt_log_probs by @tdene :: PR: [#2181] - Dynamic inference context | UVM only. by @lmcafee-nvidia :: PR: [#1983] - ci: Run `auto-update-copy-pr-bot` only on forks by @ko3n1g :: PR: [#2191] - Inference throughput tests: refactor goldens to be in list format by @mathemakitten :: PR: [#2072] - Enable TE custom quantization recipe by @negvet :: PR: [#2005] - Add MoE parameters to ModelOpt pruning example + conf fixes by @kevalmorabia97 :: PR: [#2205] - Add repr to pg collection class by @yashaswikarnati :: PR: [#2089] - Move `data_samplers.py` from `legacy` to `training.datasets` & add `DistributedSignalHandler` to DataLoader workers by @asolergi-nv :: PR: [#2068] - Fix Megatron-FSDP checkpoint save failure by @shjwudp :: PR: [#2138] - Fix moe CODEOWNERS. by @jaredcasper :: PR: [#2200] - chore: Update LICENSE by @ko3n1g :: PR: [#2219] - remove `megatron.training` dependency from `megatron.core` for FSDP checkpoint with EP by @ananthsub :: PR: [#2113] - Tensorize dynamic inference mixed sampling by @tdene :: PR: [#2105] - Add unit test for inference DP coordinator by @tdene :: PR: [#2187] - Inference linear layer by @sidsingh-nvidia :: PR: [#1908] - chore: Prefer Nvidia email addresses for reminder bot by @ko3n1g :: PR: [#2221] - [Megatron-FSDP] Fix hang caused by non-deterministic reduce-scatter by @shjwudp :: PR: [#2218] - Remove qwen symlink to fix for case-insensitive FS by @kevalmorabia97 :: PR: [#2235] - Optimizer refactor: clean up public `get_megatron_optimizer` interface and provide a more general API to support passing in different hyperparameters to subsets of parameters by @deepakn94 :: PR: [#2047] - Fix CI for PR#1983 by @lmcafee-nvidia :: PR: [#2245] - Fix aux-loss logging for hybrid models by @deepakn94 :: PR: [#2197] - Update flops calculation (for throughput) for hybrid MoEs by @deepakn94 :: PR: [#2198] - Enable kv cache in training for eagle by @yeyu-nvidia :: PR: [#1895] - Tensorize dynamic inference mixed sampling (bis) by @tdene :: PR: [#2231] - chore: Fix codeowners by @ko3n1g :: PR: [#2264] - Allow loading checkpoint from iteration 0 by @ananthsub :: PR: [#2199] - ci: Skip install test in merge queue by @chtruong814 :: PR: [#2281] - Add MoE layer type to hybrid models by @deepakn94 :: PR: [#2259] - Add the Hybrid-EP backend to the Flex Dispatcher by @Autumn1998 :: PR: [#2176] - [MAIN][NVFP4] Support NVFP4 MOE with Proper Padding by @zhongbozhu :: PR: [#1985] - Update ModelOpt example readmes and advanced usage by @kevalmorabia97 :: PR: [#2273] - Fix UVM compatibility with CUDA 13. by @lmcafee-nvidia :: PR: [#2243] - ci: Add flaky marker to LTS tests by @ko3n1g :: PR: [#2290] - Dynamic engine suspend/resume via prefill. by @lmcafee-nvidia :: PR: [#1982] - fix: Pass the timeout argument for the EP group by @yanring :: PR: [#2268] - JIT for MoE router and preprocess by @yaox12 :: PR: [#1919] - Hotfix to CI, until the fix gets reviewed by @tdene :: PR: [#2298] - Add functional test for DP coordinator throughput by @tdene :: PR: [#2189] - Add asyncio Queue like in Python 3.13 by @tdene :: PR: [#2224] - Fixes for PR#1982 by @lmcafee-nvidia :: PR: [#2303] - Fix PP KV cache allocation and enable multi-node PP inference by @santhnm2 :: PR: [#2182] - Revert active-buffer-size-gb arg name. by @lmcafee-nvidia :: PR: [#2257] - feat: check: api backwards compatibility by @pablo-garay :: PR: [#2251] - Add MambaInferenceStateConfig dataclass by @santhnm2 :: PR: [#2265] - Fix typo in inference example by @santhnm2 :: PR: [#2311] - feat: initialization of API backward compatibility verification by @pablo-garay :: PR: [#2310] - Fix Mamba TP and remove confusing legacy initialization by @jaredcasper :: PR: [#2202] - Refactor KD to use ModelOpt plugins file by @AAnoosheh :: PR: [#2305] - Fix dynamic context syntax and remove redundant tensors by @kanz-nv :: PR: [#2336] - Improve asyncio exception handling by @tdene :: PR: [#2300] - ci: Upload to testpypi only on main by @ko3n1g :: PR: [#2342] - implement graph config by @kanz-nv :: PR: [#2203] - feat: required check adjustment by @pablo-garay :: PR: [#2350] - fix: load iteration 0 for release checkpoints by @ananthsub :: PR: [#2351] - Explicitly zero out padding token activations for dynamic inference by @santhnm2 :: PR: [#2008] - Bugfix for Mamba with Chunked-Prefill by @sidsingh-nvidia :: PR: [#2293] - Break apart dynamic inference step into 2 methods by @tdene :: PR: [#2192] - Prevent unnecessarily overwriting the default Hugging Face chat template by @santhnm2 :: PR: [#2183] - Refactor KD to use ModelOpt plugins file (v2) by @AAnoosheh :: PR: [#2355] - add FIM dataset support by @dimapihtar :: PR: [#2291] - Revert "Explicitly zero out padding token activations for dynamic inference (#2008)" by @chtruong814 :: PR: [#2360] - Clean up DP coord code & unit test by @tdene :: PR: [#2277] - [4/4] Merge Megatron-RL into LM by @tdene :: PR: [#2002] - Update coordinator control logic to be compatible with RL by @tdene :: PR: [#2227] - ci: Update backwards compat check baseline to 53bbf7a by @chtruong814 :: PR: [#2361] - Account for test regression caused by prints by @tdene :: PR: [#2354] - Remove dependency on `megatron.training` within `megatron.core` by @ananthsub :: PR: [#2274] - Fixes for gpt-oss by @cuichenx :: PR: [#2038] - [HOT FIX] Fix bug of hybrid-ep backend in flex-dispatcher by @Autumn1998 :: PR: [#2286] - ci: Remove nemo-ci environment by @chtruong814 :: PR: [#2364] - ci: Pass COMMUNITY_PROJECT_ID to community bot by @chtruong814 :: PR: [#2366] - ci: Remove environment from community-bot by @chtruong814 :: PR: [#2376] - ci: Bump commit for api check to d61029f by @chtruong814 :: PR: [#2386] - Revert: trigger_mbridge_tests.yml‎ file change by @pablo-garay :: PR: [#2389] - build: Upgrade deps by @ko3n1g :: PR: [#2289] - Change KV cache init to empty to speedup graph recording and first prefill by @kanz-nv :: PR: [#2358] - Reduce Overhead in Timers by @yaox12 :: PR: [#2210] - Remove experimental tags for fused kernels. by @Victarry :: PR: [#2233] - Handle UVM compile lock issues by @tdene :: PR: [#2299] - Fix the entropy sign. by @yobibyte :: PR: [#2374] - Remove RL use of mock dataloader and kill RL inference interface on exit by @jon-barker :: PR: [#2387] - Fix block_bag for RL by @kanz-nv :: PR: [#2399] - adding action for checking whether PR author is nvidia employee or not for selecting ephemeral ci hosts by @theothermike :: PR: [#2402] - Added top n log probs by @shanmugamr1992 :: PR: [#2262] - fix: exit failure when PR author is external contributor removed by @theothermike :: PR: [#2410] - Fix logging when no IS is enabled. by @yobibyte :: PR: [#2375] - Various small fixes for Megatron-FSDP. by @cspades :: PR: [#2346] - Add grpo loop functional test by @jon-barker :: PR: [#2403] - YARN position embedding clear forward method lru cache in init function by @guyueh1 :: PR: [#2229] - Graph Config Implementation by @kanz-nv :: PR: [#2380] - fix: adding k8s taints for ephermeral jobs by @theothermike :: PR: [#2420] - ci: Enable functional tests by @ko3n1g :: PR: [#2419] - Reapply "build: Upgrade deps ([#2289])" by @ko3n1g :: PR: [#2408] - fix: use a script to do node tainting in the cicd workflow by @theothermike :: PR: [#2421] - Fix rl training with data reuse. by @yobibyte :: PR: [#2428] - Reapply - Add grpo loop functional test by @jon-barker :: PR: [#2411] - chore: Add copyright to run_simple_mcore_train_loop.py by @chtruong814 :: PR: [#2441] - Retry inference test on different device if throughput slower than expected by @mathemakitten :: PR: [#2443] - feat: mcore trigger mbridge by @pablo-garay :: PR: [#2340] - Remove redundant reduce in aux_loss logging by @BestJuly :: PR: [#2095] - chore: Update codeowners for post-training by @ko3n1g :: PR: [#2462] - [Fix] Pass metadata to sharded_state_dict in load_modelopt_checkpoint by @kevalmorabia97 :: PR: [#2451] - Add support for fake distributed process groups. by @Victarry :: PR: [#2280] - fix: Add merge_group support with pre-flight pattern by @pablo-garay :: PR: [#2463] - Add missing checkpoint arguments for MoE models by @santhnm2 :: PR: [#2465] - Add assertion for mxfp8 params without dp overlap by @kunlunl :: PR: [#2271] - Clean log probs by @shanmugamr1992 :: PR: [#2404] - ci: Bump copyright workflow by @ko3n1g :: PR: [#2473] - Fix `ImportError` and `NameError` in `examples/run_simple_mcore_train_loop.py` by @marksverdhei :: PR: [#1980] - fix: Revert "Clean log probs (#2404)" by @chtruong814 :: PR: [#2475] - Make grpo CI test use read-only data by @jon-barker :: PR: [#2472] - Fix default.yaml for HFDatasetAgent use in countdown by @jon-barker :: PR: [#2487] - Update golden values to allow new PRs to be merged by @tdene :: PR: [#2478] - Clean log probs copy by @shanmugamr1992 :: PR: [#2477] - Attention mask as PackedSeqParams by @jalbericiola :: PR: [#2461] - fp8 param cuda graph support main by @kunlunl :: PR: [#2088] - docs: Add changelog for 0.15 by @ko3n1g :: PR: [#2499] - feat: improve external contributor single use ephemeral nodes by @theothermike :: PR: [#2503] - Fix sequence parallel. by @yobibyte :: PR: [#2444] - update API check baseline by @pablo-garay :: PR: [#2505] - Associate default rl cuda graphs attributes with args by @yobibyte :: PR: [#2453] - No using tokenizer in request record. by @lmcafee-nvidia :: PR: [#2382] - make default --inference-dynamic-batching-cuda-graph-max-tokens value match old version by @jon-barker :: PR: [#2540] - Adjust the default CG size for functional test by @tdene :: PR: [#2544] - feat: API compat: ignore AttributeChangedValueBreakage (not a signature change) by @pablo-garay :: PR: [#2543] - feat: add decorator: experimental_api by @pablo-garay :: PR: [#2539] - ci: Add release workflows by @ko3n1g :: PR: [#2507] - Fixing PG routing for inference & training separation by @wdykas :: PR: [#2485] - ci: Fix release workflow by @ko3n1g :: PR: [#2553] - fix: Duplicate artifact names by @ko3n1g :: PR: [#2556] - ci: Avoid naming collision by @ko3n1g :: PR: [#2558] - ci: Fixing naming collision by @ko3n1g :: PR: [#2559] - fix: publish release wheel and github release version number by @ko3n1g :: PR: [#2561] - Fix MoE capacity handling by @DaizeDong :: PR: [#2214] - Avoid calling set_save_original_input with FP8 delayed scaling by @dalgarak :: PR: [#1860] - build: Bump TE to 2.10 by @ko3n1g :: PR: [#2496] - add more tokenizer arguments by @dimapihtar :: PR: [#2377] - Add per-module TE quant config. by @kwyss-nvidia :: PR: [#2359] - Make check_large_grads non-fatal by @kwyss-nvidia :: PR: [#2307] - fix for sequence packing plus sequence parallel: padding the sequence to a multiple of TP by @jalbericiola :: PR: [#2574] - Torch symmetric - new latency optimized NVLS communication kernels for sequence parallelism by @sidsingh-nvidia :: PR: [#1997] - Various quality-of-life improvements in training loop by @deepakn94 :: PR: [#2580] - [Main] Support MTP packed-seq in main branch by @BestJuly :: PR: [#2173] - Support TP greater than num_kv_heads by supporting QKV activation sub-sharding by @deepakn94 :: PR: [#2565] - Fix FA3 import by @santhnm2 :: PR: [#2577] - Fix runaway Etpt in straggler detector by resetting FLOPs accumulator by @cms42 :: PR: [#1755] - Rename TensorRT Model Optimizer to Model Optimizer by @AAnoosheh :: PR: [#2373] - Fix aux loss scale when CP is enabled. by @Victarry :: PR: [#2237] - Save memory using main_param for moe in param_l2_norm by @BestJuly :: PR: [#2249] - Changes to support latent MoEs by @deepakn94 :: PR: [#2296] - update API compat check baseline to b51db3e by @pablo-garay :: PR: [#2588] - Fix invalid argument failing tests on main by @tdene :: PR: [#2589] - Add openmathinstruct config. by @yobibyte :: PR: [#2586] - Move model configs to github. by @yobibyte :: PR: [#2587] - fix: Assign tokenizer to Encoder.tokenizer in legacy mode by @iuyo5678 :: PR: [#2498] - Delete redundant import in yaml_arguments.py by @wplf :: PR: [#2139] - Fix world size mismatch causing distributed init deadlock (Issue [#2458]) by @CodersAcademy006 :: PR: [#2571] - Improve performance of request_metadata logic by @tdene :: PR: [#2378] - Fix broken Table of Contents links in README.md by @JungHoyoun :: PR: [#1954] - Add minor log update by @gautham-kollu :: PR: [#2080] - Fix link to NeMo performance summary documentation by @janbernloehr :: PR: [#2190] - Prep for refit by @wdykas :: PR: [#2590] - feat: API compat: ignore ParameterMovedBreakage for __init__ methods by @pablo-garay :: PR: [#2595] - Fix NameError in pretrain_retro.py (add import_module), remove unused… by @vignesh1507 :: PR: [#2084] - QK logits clipping (non-split version) by @BoxiangW :: PR: [#1929] - update checkpointing documentation by @dimapihtar :: PR: [#2606] - [training migration] add training config dataclass and arg generation utility by @maanug-nv :: PR: [#2306] - Check skip_prompt_log_probs in add_request by @tdene :: PR: [#2593] - Refit prep 2 by @wdykas :: PR: [#2608] - Batch Invariance by @wdykas :: PR: [#2308] - Remove flattened_range code paths for distributed optimizer checkpointing by @dimapihtar :: PR: [#2126] - update commit by @dimapihtar :: PR: [#2631] - Create separate teacher Layer Spec in KD mode by @AAnoosheh :: PR: [#2429] - [docs] Migrate docs to new Sphinx by @Phlip79 :: PR: [#2489] - Nemotron nano v2 vl changes for Megatron Bridge by @cuichenx :: PR: [#2078] - Dynamic context | Re-add max_requests arg. by @lmcafee-nvidia :: PR: [#2488] - Inference | Fix entangled request generations. by @lmcafee-nvidia :: PR: [#2584] - fix gpt3_mcore_reruns_resume_check_grads by @dimapihtar :: PR: [#2646] - Add option to only log inference every N steps by @tdene :: PR: [#2637] - [docs] Use autodoc2 and remove automodule by @Phlip79 :: PR: [#2542] - add backward compatibility support for loading mcore 0.15 checkpoints by @dimapihtar :: PR: [#2648] - add offline eagle3 instructions to readme by @yeyu-nvidia :: PR: [#2246] - Only initialize symmetric memory when needed by @sidsingh-nvidia :: PR: [#2665] - Update docstrings for dataset by @Phlip79 :: PR: [#2666] - Simplify parameter sync for checkpoint save by @ananthsub :: PR: [#2344] - [Megatron-FSDP] Support both old and new DeviceMesh APIs. by @cspades :: PR: [#2575] - Enable hybrid tensor + expert + data parallelism in mcore inference by @sidsingh-nvidia :: PR: [#2470] - Fix failing functional tests by @sidsingh-nvidia :: PR: [#2679] - M4 + Dist Checkpoint: Replace global parallel state with explicit group parameters by @dimapihtar :: PR: [#2053] - fix deprecated decorator import by @dimapihtar :: PR: [#2680] - Inference | Add request only if no paused requests. by @lmcafee-nvidia :: PR: [#2600] - Added integration for Kitchen extensions'\'' SDPA and FA implementations by @frsun-nvda :: PR: [#2232] - Pipeline parallelism fix in RL and sequence packing rewriting by @jalbericiola :: PR: [#2632] - Add oncall rotation by @Phlip79 :: PR: [#2622] - Upgrade GitHub Actions to latest versions by @salmanmkc :: PR: [#2678] - docs: Adding documentation.md to cover building documentation. by @aschilling-nv :: PR: [#2683] - [Megatron-FSDP] Build default FSDP DeviceMesh, and remove model arg from fully_shard_optimizer(). by @cspades :: PR: [#2471] - Add moe layer perf UT. by @Victarry :: PR: [#2673] - [docs] Add ability to disable autodoc2 for local builds by @Phlip79 :: PR: [#2669] - Fix oncall assignment by @Phlip79 :: PR: [#2686] - docs(readme): update Latest News section by @sbhavani :: PR: [#2684] - Update RNG sharding to include EP rank by @paul-gibbons :: PR: [#2658] - Add CODEOWNER for API backwards compatibility check files by @pablo-garay :: PR: [#2687] - Mark API backwards compatibility checks as OPTIONAL (non-blocking) by @pablo-garay :: PR: [#2697] - pip install uv during GH action by @Phlip79 :: PR: [#2695] - Don'\''t delete svcnvidia-nemo-ci team from oncall by @Phlip79 :: PR: [#2703] - RL: Rollouts should be distributed over the regular data parallel group by @sidsingh-nvidia :: PR: [#2634] - Use pull_request_target and don'\''t use uv by @Phlip79 :: PR: [#2702] - Optimize TE cudagraph input memory by @buptzyb :: PR: [#2392] - ci(fix): Pin gojq to stable version by @ko3n1g :: PR: [#2480] - NVLS - fused reduce-scatter + residual + rms-norm + all-gather kernel by @sidsingh-nvidia :: PR: [#2599] - Default UVM level to 0. by @lmcafee-nvidia :: PR: [#2450] - docs: improve documentation organization and add additional guides by @sbhavani :: PR: [#2671] - Revert "Default UVM level to 0. (#2450)" by @chtruong814 :: PR: [#2713] - Add missing imports in no-triton fallback by @maanug-nv :: PR: [#2711] - Fixes for [#2450]. by @lmcafee-nvidia :: PR: [#2714] - Add RL parameter to set parallel generation tasks by @tdene :: PR: [#2712] - Refit prep 3 by @wdykas :: PR: [#2708] - chore: Add cudagraph codeowners by @ko3n1g :: PR: [#2720] - [docs] Add developer section to docs by @Phlip79 :: PR: [#2717] - Fix UVM argument for RL by @tdene :: PR: [#2722] - [docs] Update docs title to Megatron Core by @Phlip79 :: PR: [#2729] - remove fp16 assert in moe_grouped_gemm & EP by @HaochenYuan :: PR: [#2495] - Improve ModelOpt paths & add more Nemotron/hybrid model support by @jenchen13 :: PR: [#2131] - Add options to improve data loader initialization time, especially at scale by @asolergi-nv :: PR: [#2445] - ci: Fix copy-pr-bot update by @ko3n1g :: PR: [#2736] - Add oncall to all new PRs by @Phlip79 :: PR: [#2734] - Hsdp register submesh fix lifuz mirror by @tomlifu :: PR: [#2467] - Update sequence packing case when dummy PackedSeqParams are used by @mathemakitten :: PR: [#2743] - Add support for non-decode CUDA graphs for Mamba models by @santhnm2 :: PR: [#2474] - Fix oncall assign by @Phlip79 :: PR: [#2737] - Adding stop word support by @shanmugamr1992 :: PR: [#2685] - feat: manual registration mode for nccl-ub option when using megatron-fsdp by @youngeunkwon0405 :: PR: [#2661] - Update oncall for next few weeks by @Phlip79 :: PR: [#2748] - Prep work for migrating to types from ModuleSpec by @nschank :: PR: [#2668] - feat(MoE): Refactor cuda_graph_scope by @buptzyb :: PR: [#1920] - Fix merge conflict in [#1920] by @tdene :: PR: [#2781] - ci: Allow disabling external contributors by @chtruong814 :: PR: [#2784] - Reflect the changes made by [#1920] in RL by @tdene :: PR: [#2780] - Fix 2780 by @tdene :: PR: [#2791] - Update PR message by @Phlip79 :: PR: [#2778] - Ignore bot for oncall by @Phlip79 :: PR: [#2756] - Only assign oncall to main PRs by @Phlip79 :: PR: [#2755] - Explicitly zero out padding token outputs when using quantization scales by @santhnm2 :: PR: [#2585] - Synchronize total block count across pipeline parallel ranks by @santhnm2 :: PR: [#2578] - Optimize TE CUDA Graph capturing time by @buptzyb :: PR: [#2482] - Do a pass of typing fixes on transformer/ by @nschank :: PR: [#2766] - moe: remove unused variable scale_up by @WineChord :: PR: [#1670] - build: Pin down `nvidia-nvshmem-cu13` (#2798) by @ko3n1g :: PR: [#2803] - DeepSeek V3 FSDP Fix for Precision-Aware Optimizer by @tomlifu :: PR: [#2466] - Minor Fixes on Post-Training ModelOpt Examples by @ChenhanYu :: PR: [#2813] - fix(moe): Support HybridEP and reduce memory overhead for 1F1B A2A overlap by @lhb8125 :: PR: [#2236] - Inference memory test by @wdykas :: PR: [#2724] - Move batch invariance mode init to initialize.py by @santhnm2 :: PR: [#2832] - Move full model init to cuda stream to avoid race condition leading to empty parameters in DDP by @jstjohn :: PR: [#2652] - [docs] Cleanup homepage by @Phlip79 :: PR: [#2823] - [docs] Update oncall doc by @Phlip79 :: PR: [#2822] - Make default for rerun_mode=disabled not terminate with non-fatal rer… by @kwyss-nvidia :: PR: [#2773] - Bugfix: ensure spawned persistent checkpoint worker sets its CUDA device correctly for CUDA context creation / hypothetical memory allocations by @ankurv-nvidia :: PR: [#2710] - Implementation of a more flexible optimizer/scheduler override system by @jstjohn :: PR: [#2723] - ci(fix): PyPI upload by @ko3n1g :: PR: [#2843] - ci(fix): Don'\''t fail on empty var by @ko3n1g :: PR: [#2850] - Add RL support for MOEs by @jon-barker :: PR: [#2742] - ci(fix): GH release version tag by @ko3n1g :: PR: [#2854] - Reduce the scope of the side stream around DDP initialization by @jstjohn :: PR: [#2852] - Manually update first oncall rotation by @Phlip79 :: PR: [#2855] - Remove flaky iteration time functional test by @buptzyb :: PR: [#2862] - Nccl gloo refit for RL by @wdykas :: PR: [#2812] - build: Bump jet-client by @ko3n1g :: PR: [#2876] - Dynamic Inference | Evict and re-compute context requests. by @lmcafee-nvidia :: PR: [#2738] - Change oncall team name by @Phlip79 :: PR: [#2861] - Revert "Dynamic Inference | Evict and re-compute context requests. (#2738)" by @chtruong814 :: PR: [#2884] - [main] feat(moe): Support moe shared expert gate for Qwen3-Next (2/4) by @yuzhongw-nvidia :: PR: [#2751] - [main] feat(moe): Support attention output gate for Qwen3-Next (3/4) by @yuzhongw-nvidia :: PR: [#2752] - [docs] Fix docs and add generation doc by @Phlip79 :: PR: [#2882] - Fix CUDA RNG Tracker by @buptzyb :: PR: [#2641] - FP8 params support for megatron-fsdp (MXFP8/Blockwise) by @kunlunl :: PR: [#2239] - docs: fix broken images, links, and typos across documentation by @sbhavani :: PR: [#2794] - ci(fix): Release version by @ko3n1g :: PR: [#2873] - Assign mcore-oncall instead of user by @Phlip79 :: PR: [#2879] - tests: Disable Mamba MOE model test after 43b4471 by @ko3n1g :: PR: [#2886] - Fix mamba moe unit test after commit reversion by @jon-barker :: PR: [#2888] - Fix inference server to make nemogym work. by @yobibyte :: PR: [#2887] - Use DynamicInferenceCoordinator for text generation server by @santhnm2 :: PR: [#1910] - Improve error messages in mamba moe unit test by @jon-barker :: PR: [#2889] - [training migration] add RNG config dataclass by @maanug-nv :: PR: [#2347] - [training migration] Add RerunStateMachineConfig dataclass by @maanug-nv :: PR: [#2436] - Add retry loop with exponential backoff in dataloader as a form of in-application fault tolerance by @deepakn94 :: PR: [#2836] - [training migration] Add SchedulerConfig dataclass by @maanug-nv :: PR: [#2400] - RL: Fix cu_seqlens construction for PackedSeqParams by @mathemakitten :: PR: [#2883] - [training migration] Add ProfilingConfig dataclass by @maanug-nv :: PR: [#2393] - [MoE] Apply grouped gemm bias before unpadding for FP8 by @cuichenx :: PR: [#2817] - Update Slack user group when oncall changes by @Phlip79 :: PR: [#2859] - Remove unused FlashAttention3 args by @santhnm2 :: PR: [#2898] - Use different token for assign logic by @Phlip79 :: PR: [#2893] - chore: Add `--no-container-mount-home` to script by @ko3n1g :: PR: [#2906] - build: Bump deps by @ko3n1g :: PR: [#2911] - Fix RL sequence packing bin size by @tdene :: PR: [#2909] - feat: m4 leftover changes by @yaoyu-33 :: PR: [#2506] - Revert "Remove unused FlashAttention3 args (#2898)" by @chtruong814 :: PR: [#2916] - ci: Skip broken tests after dependency bump by @chtruong814 :: PR: [#2934] - Ko3n1g/build/downgrade flashinfer by @ko3n1g :: PR: [#2937] - ci: Skip unit test cleanup by @chtruong814 :: PR: [#2940] - build: 26.02 dependency bump main by @ko3n1g :: PR: [#2923] - RL refit pipelining support by @wdykas :: PR: [#2878] - [MAIN][NVFP4][MOE] 128 Zero Padding for Grouped Quantization kernels and Cuda Graph Support by @zhongbozhu :: PR: [#2655] - Support DDP overlap for models with repeated parameters by @deepakn94 :: PR: [#2837] - Add muon and layerwise distributed optimizer by @FDecaYed :: PR: [#2241] - Revert "[dev] Add assertion for mxfp8 params without dp overlap (#2270)" by @ko3n1g :: PR: [#2901] - Unit test for model_provider to model_builder coupling by @AAnoosheh :: PR: [#2925] - ci: Onboard GB200 by @ko3n1g :: PR: [#2847] - Install slack-sdk using uv by @Phlip79 :: PR: [#2948] - Inference | Evict overflow paused requests from context. by @lmcafee-nvidia :: PR: [#2926] - Enable training cudagraphs for RL by @mathemakitten :: PR: [#2452] - feat(moe): Support placing MTP layers into standalone stages by @BestJuly :: PR: [#2136] - Various fixes to in-job restarter and better time accounting of startup operations by @hexinw-nvidia :: PR: [#2698] - Fix minor README wording and capitalization by @Deepak-J0shi :: PR: [#2928] - ci: Restore grpo tests by @ko3n1g :: PR: [#2952] - Fix GitHub GRPO resharding functional test by @tdene :: PR: [#2927] - cp: `ci(fix): GB200 racecondition (2962)` into `main` by @ko3n1g :: PR: [#2963] - Add out-of-SLA link by @Phlip79 :: PR: [#2903] - feat(moe): Fine-grained activation offloading by @lhb8125 :: PR: [#1913] - Fix broken mamba-moe unit test by @jon-barker :: PR: [#2970] - ci: Fix GB200 change by @ko3n1g :: PR: [#2969] - Update golden values for reshard test by @tdene :: PR: [#2971] - chore: Update golden values by @ko3n1g :: PR: [#2973] - Pass through --trust-remote-code and add this to all Nemotron model configs by @ChenhanYu :: PR: [#2939] - Cuda 13 UVM by @wdykas :: PR: [#2957] - Enable phase transition iterations by @jkamalu :: PR: [#2938] - add missing import in rl_utils.py by @jon-barker :: PR: [#2915] - Add sequence packing support for hybrid model by @duncanriach :: PR: [#2913] - [Main] Partial CUDA Graph support for EP Overlap by @Wohox :: PR: [#2184] - docs(megatron-fsdp): add Megatron-FSDP user guide by @xuwchen :: PR: [#2396] - DeepSeek V3.2 support by @kunlunl :: PR: [#2440] - fully remove zarr support by @dimapihtar :: PR: [#2944] - chore: Standardize setuptools version by @ko3n1g :: PR: [#2975] - ci: Run functional tests on main by @ko3n1g :: PR: [#2983] - ci(fix): CI_COMMIT_BRANCH on forks by @ko3n1g :: PR: [#2982] - [main] feat(moe): Support gated delta net for Qwen3-Next (1/4) by @yuzhongw-nvidia :: PR: [#1989] - ci: Add more gb200 nightly tests by @ko3n1g :: PR: [#2981] - [main] feat(moe): Support apply wd to qk layernorm for Qwen3-Next (4/4) by @yuzhongw-nvidia :: PR: [#2753] - Re-submit "Various fixes to in-job restarter and better time accounting of startup operations" by @hexinw-nvidia :: PR: [#2954] - Use slack-sdk in a different manner by @Phlip79 :: PR: [#2950] - Hybrid Context Parallel Feature by @parthmannan :: PR: [#2282] - Inference | Move `assert active_request_count > 0`. by @lmcafee-nvidia :: PR: [#2958] - Set `token_dtype_code` init value in `GPTDatasetConfig` to fix CI by @asolergi-nv :: PR: [#2912] - [main] ci(moe): Add `--apply-wd-to-qk-layernorm` flag to the gdn test case by @yuzhongw-nvidia :: PR: [#2995] - ci: Disable step time on `gpt3_moe_mcore_te_tp2_pp2_ep4_etp1_no_mtp_n… by @ko3n1g :: PR: [#2991] - ci: Fix workflows on main by @ko3n1g :: PR: [#2990] - Make Megatron-FSDP torch.compile compatible by @shjwudp :: PR: [#2425] - [Megatron-FSDP] Test FP8 activations + parameter sharding with Megatron-FSDP fully-shard. Update README. by @cspades :: PR: [#2894] - chore: Escape special chars by @ko3n1g :: PR: [#3014] - Improve memory logging by @deepakn94 :: PR: [#2839] - Add a wrapper function for FA3 _flash_attn_forward call by @santhnm2 :: PR: [#2933] - chore: Set umask 0002 by @ko3n1g :: PR: [#3027] - Make attn mask inversion in-place instead of allocating it again by @mathemakitten :: PR: [#3019] - [Megatron-FSDP] Fix incorrect gradient scaling target. by @cspades :: PR: [#3023] - Replaces ModuleSpec with Protocols for some of the inputs to SelfAttention/CrossAttention by @nschank :: PR: [#2761] - Various CUDA graph improvements on capture time, replay time, memory footprint by @jiemingz :: PR: [#2572] - Update oncall schedule by @Phlip79 :: PR: [#3017] - Ensure that last prefill chunk is handled correctly by Mamba models by @santhnm2 :: PR: [#2897] - Add script for batch running CI tests across distinct nodes by @jon-barker :: PR: [#3047] - Refit EP support by @wdykas :: PR: [#2972] - Catch case of negative tokens to generate by @tdene :: PR: [#2985] - Sync GitHub and Slack teams by @Phlip79 :: PR: [#3037] - ci: Remove Github transition comment from CI by @chtruong814 :: PR: [#2881] - Support custom Router implementations in MoELayer by @nschank :: PR: [#2891] - ci: Override N_REPEAT by @ko3n1g :: PR: [#3051] - Update type hints and doc strings for moe_utils.py by @JavaZeroo :: PR: [#2821] - Supporting inference when called within an asyncio loop by @shanmugamr1992 :: PR: [#2816] - Remove calculation of padding token in moe routing loss by @HaochenYuan :: PR: [#2142] - Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: [#3049] - Revert "Bug fix with --no-use-tokenizer-from-checkpoint-args (#3049)" by @thomasdhc :: PR: [#3057] - Add health endpoint to dynamic text gen server by @santhnm2 :: PR: [#3009] - ci: Skip test_precision_aware_optimizer by @thomasdhc :: PR: [#3062] - Support multimodule communication by @yaoyu-33 :: PR: [#2031] - Revert "Support multimodule communication (#2031)" by @ko3n1g :: PR: [#3068] - Revert "Remove calculation of padding token in moe routing loss (#2142)" by @ko3n1g :: PR: [#3069] - Add ability to save wgrads and dgrads by @deepakn94 :: PR: [#3032] - ci: Mark test_mode_partial_cudagraph unit tests as flaky by @chtruong814 :: PR: [#3064] - Keep FSDP'\''s and DDP'\''s finish_grad_sync API identical by @deepakn94 :: PR: [#3070] - (REPLAY) Bug fix with --no-use-tokenizer-from-checkpoint-args by @jon-barker :: PR: [#3059] - Optimizing post-processing of requests by @sidsingh-nvidia :: PR: [#2920] - Fix broken functional tests in [#2920] by @sidsingh-nvidia :: PR: [#3071] - fix ep weight gradnorm/num_zero calculation error for muon by @FDecaYed :: PR: [#3024] - [training migration] Add LoggerConfig dataclass by @maanug-nv :: PR: [#2414] - Added --ft-num-warmup-iters option. by @hexinw-nvidia :: PR: [#3052] - Reapply "Various CUDA graph improvements on capture time, replay time, memory footprint (#2572)" by @jiemingz :: PR: [#3056] - fix(fsdp): add CLI argument for outer_dp_sharding_strategy by @liuyun7345 :: PR: [#3053] - ci: Log node name by @ko3n1g :: PR: [#3081] - docs: Release docs by @ko3n1g :: PR: [#3055] - Support NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 FP8/NVFP4 PTQ in example by @ChenhanYu :: PR: [#3079] - add all_gather process-group for overlapping in fsdp disributed training by @jeffnvidia :: PR: [#2663] - Add router replay for MoE models by @litianjian :: PR: [#2101] - ci: Disable gpt_dynamic_inference_tp1_pp1_dp8_583m_throughputtest_zmq by @ko3n1g :: PR: [#3099] - ci: Repeat func tests, save logs of unit tests and lessen debug output by @ko3n1g :: PR: [#3089] - ci: Update improvement of step-time by @ko3n1g :: PR: [#3104] - ci: Add GPU health checks by @ko3n1g :: PR: [#3100] - Harden GRPO functional tests by @jon-barker :: PR: [#3065] - build: Bump to TE2.12 by @ko3n1g :: PR: [#3086] - Inference functional tests: Write outputs to INFERENCE_OUTPUT_PATH instead of TENSORBOARD_PATH by @mathemakitten :: PR: [#3061] - Update moe readme. by @Victarry :: PR: [#2830] - Logging cleanup (only log on rank 0 if possible) by @deepakn94 :: PR: [#3036] - Move all bert and t5 tests to nightly by @Phlip79 :: PR: [#3106] - Create greptile.json by @Phlip79 :: PR: [#3087] - Fix bug of reuse_grad_buf_for_mxfp8_param_ag by @kunlunl :: PR: [#2802] - Fix for Hybrid CP by @parthmannan :: PR: [#3091] - Fix GRPO re-fit functional test by @jon-barker :: PR: [#3113] - Minimize README contents by @megnvidia :: PR: [#3020] - Add end-to-end tests for M-FSDP and ND-Parallel by @shjwudp :: PR: [#3031] - [M-FSDP] Fix double buffering not working with activation recompute by @shjwudp :: PR: [#2689] - Fix Multimodal Dockerfile by @faradawn :: PR: [#3006] - [training migration] Add CheckpointConfig dataclass by @maanug-nv :: PR: [#2431] - [training migration] Add StragglerDetectionConfig dataclass by @maanug-nv :: PR: [#2435] - Standardize RL unit tests by @tdene :: PR: [#3088] - Use the latest hybrid-ep by @Autumn1998 :: PR: [#3093] - remove retro by @dimapihtar :: PR: [#3001] - ci: Mark test_compatible_with_nd_parallel as flaky by @ko3n1g :: PR: [#3122] - build: Use merge-commit-sha for container by @ko3n1g :: PR: [#3123] - Refactor `rl_offload_kv_cache_during_training` to offload KV cache to CPU while retaining fixed virtual address by @mathemakitten :: PR: [#3048] - Disable Greptile status comments by @Phlip79 :: PR: [#3127] - ci: Add unit tests to merge queue by @ko3n1g :: PR: [#3125] - Create CodeRabbit config by @Phlip79 :: PR: [#3131] - build: Explicitly set minimum torch version to >= 2.6.0 by @chtruong814 :: PR: [#3085] - Move kitchen extension file to private kitchen repository by @kwyss-nvidia :: PR: [#2779] - Revert "Fix RL optimizer offload (#3112)" by @ko3n1g :: PR: [#3141] - Revise and move KD docs by @AAnoosheh :: PR: [#3108] - build: Bump FLA by @ko3n1g :: PR: [#3139] - ci: Add job timeouts by @ko3n1g :: PR: [#3142] - Multiturn rollout support prep by @yobibyte :: PR: [#2966] - ci: Set NODE_RANK by @ko3n1g :: PR: [#3143] - Reapply [3955c4] by @jon-barker :: PR: [#3146] - Revert "Multiturn rollout support prep (#2966)" by @ko3n1g :: PR: [#3153] - Fix coderabbit instructions error by @Phlip79 :: PR: [#3150] - Force input ids generated by mock dataset are < vocab_size by @asolergi-nv :: PR: [#2945] - Add a check to make sure we are distributing all the layers when using `--decoder-first-pipeline-num-layers` & `--decoder-last-pipeline-num-layers` by @asolergi-nv :: PR: [#2947] - Automatically choose available ports in ZMQ by @tdene :: PR: [#2278] - Generate arguments from TransformerConfig by @maanug-nv :: PR: [#2896] - Fix for PR-2142 by @HaochenYuan :: PR: [#3165] - ci: Onboard more GB200 tests by @ko3n1g :: PR: [#3145] - ci(hotfix): Alert for GB200 by @ko3n1g :: PR: [#3168] - Fix SFTDataset truncation bug by @duncanriach :: PR: [#3158] - Vitalyk/multiturn v2 by @yobibyte :: PR: [#3167] - ci: Disable the api check for now by @chtruong814 :: PR: [#3157] - ci: Add DSv3 proxy by @ko3n1g :: PR: [#3169] - Nvshmem refit by @wdykas :: PR: [#2696] - [Community][Main] fix(moe): Fix theoretical memory calculation of layernorm. by @1195343015 :: PR: [#2434] - fix: Set --refit-method default to gloo by @wdykas :: PR: [#3172] - [fix] Bug fix for offloading in evaluate() by @lhb8125 :: PR: [#3043] - cp: `Fix: nccl-ub in ddp path (3181)` into `main` by @ko3n1g :: PR: [#3182] - Miscellaneous inference cleanup by @santhnm2 :: PR: [#2955] - ci: Fix DSv3 by @ko3n1g :: PR: [#3188] - Fix missing argument in MoELayer.forward() by @jiemingz :: PR: [#3133] - Fix H2D stream synchronization in optimizer offload by @tgkyrie :: PR: [#3140] - Add MTP support for hybrid models by @rkarimimahab :: PR: [#2363] - docs: improve Megatron-LM and Megatron Core descriptions by @sbhavani :: PR: [#3115] - Handle `step` key correctly in checkpoint save with `--optimizer-cpu-offload` by @ahmadki :: PR: [#2874] - cp: `ci: Checkpoint retention (3205)` into `core_r0.16.0` by @ko3n1g :: PR: [#3222] - cp: `Fix uv install for GH actions (#3259)` by @ko3n1g :: PR: [#3275] - cp: `Fix missing PackedSeqParams import (3214)` into `core_r0.16.0` by @ko3n1g :: PR: [#3236] - cp: `fix: numpy overflow (3306)` into `core_r0.16.0` by @ko3n1g :: PR: [#3328] - Missing import fix (#3241) by @parthmannan :: PR: [#3298] - cp: `fix: T5 dataset (#3307)` by @ko3n1g :: PR: [#3329] - cp: `build: Bump TE on 2.12` by @ko3n1g :: PR: [#3372] - cp: `Improved parallel logging of learning rate` by @ko3n1g :: PR: [#3367] - cp: `ci: Update release-docs workflow to use FW-CI-templates v0.72.0 (3438)` into `core_r0.16.0` by @ko3n1g :: PR: [#3454] - cp: `ci: Remove environments (3462)` into `core_r0.16.0` by @ko3n1g :: PR: [#3481] - cp: Update release workflow to include changelog and publish docs (#3472) by @chtruong814 :: PR: [#3480] - chore(beep boop 🤖): Bump `uv.lock` (core_r0.16.0) (2026-02-19) by @svcnvidia-nemo-ci :: PR: [#3502] - docs: Update docs for 0.16.0 by @chtruong814 :: PR: [#3505] - chore(beep boop 🤖): Bump `uv.lock` (core_r0.16.0) (2026-02-23) by @svcnvidia-nemo-ci :: PR: [#3533] - docs: Update docs version picker for 0.16.0 to include nightly by @chtruong814 :: PR: [#3547] - cp: `ci: Test docs build (#3583)` by @ko3n1g :: PR: [#3593] - cp: Changes of CICD workflow by @ko3n1g :: PR: [#3603]

Source: README.md, updated 2026-02-26

Megatron Files

Ongoing research training transformer models at scale

Megatron Files

Ongoing research training transformer models at scale

Get an email when there's a new version of Megatron