FairScale
PyTorch extensions for high performance and large scale training
...It introduced Fully Sharded Data Parallel (FSDP) style techniques that shard model parameters, gradients, and optimizer states across ranks to fit bigger models into the same memory budget. The library also provides pipeline parallelism, activation checkpointing, mixed precision, optimizer state sharding (OSS), and auto-wrapping policies that reduce boilerplate in complex distributed setups. Its components are modular, so teams can adopt just the sharding optimizer or the pipeline engine without rewriting their training loop. FairScale puts emphasis on correctness and debuggability, offering hook points, logging, and reference examples for common trainer patterns. ...