Fairseq
Facebook AI Research Sequence-to-Sequence Toolkit written in Python
...We provide reference implementations of various sequence modeling papers. Recent work by Microsoft and Google has shown that data parallel training can be made significantly more efficient by sharding the model parameters and optimizer state across data parallel workers. These ideas are encapsulated in the new FullyShardedDataParallel (FSDP) wrapper provided by fairscale. Fairseq can be extended through user-supplied plug-ins. Models define the neural network architecture and encapsulate all of the learnable parameters. Criterions compute the loss function given the model outputs and targets. ...