Amazon EC2 Capacity Blocks for ML
Amazon EC2 Capacity Blocks for ML enable you to reserve accelerated compute instances in Amazon EC2 UltraClusters for your machine learning workloads. This service supports Amazon EC2 P5en, P5e, P5, and P4d instances, powered by NVIDIA H200, H100, and A100 Tensor Core GPUs, respectively, as well as Trn2 and Trn1 instances powered by AWS Trainium. You can reserve these instances for up to six months in cluster sizes ranging from one to 64 instances (512 GPUs or 1,024 Trainium chips), providing flexibility for various ML workloads. Reservations can be made up to eight weeks in advance. By colocating in Amazon EC2 UltraClusters, Capacity Blocks offer low-latency, high-throughput network connectivity, facilitating efficient distributed training. This setup ensures predictable access to high-performance computing resources, allowing you to plan ML development confidently, run experiments, build prototypes, and accommodate future surges in demand for ML applications.
Learn more
AWS Elastic Fabric Adapter (EFA)
Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. Its custom-built operating system (OS) bypass hardware interface enhances the performance of inter-instance communications, which is critical to scaling these applications. With EFA, High-Performance Computing (HPC) applications using the Message Passing Interface (MPI) and Machine Learning (ML) applications using NVIDIA Collective Communications Library (NCCL) can scale to thousands of CPUs or GPUs. As a result, you get the application performance of on-premises HPC clusters with the on-demand elasticity and flexibility of the AWS cloud. EFA is available as an optional EC2 networking feature that you can enable on any supported EC2 instance at no additional cost. Plus, it works with the most commonly used interfaces, APIs, and libraries for inter-node communications.
Learn more
Amazon EC2 P4 Instances
Amazon EC2 P4d instances deliver high performance for machine learning training and high-performance computing applications in the cloud. Powered by NVIDIA A100 Tensor Core GPUs, they offer industry-leading throughput and low-latency networking, supporting 400 Gbps instance networking. P4d instances provide up to 60% lower cost to train ML models, with an average of 2.5x better performance for deep learning models compared to previous-generation P3 and P3dn instances. Deployed in hyperscale clusters called Amazon EC2 UltraClusters, P4d instances combine high-performance computing, networking, and storage, enabling users to scale from a few to thousands of NVIDIA A100 GPUs based on project needs. Researchers, data scientists, and developers can utilize P4d instances to train ML models for use cases such as natural language processing, object detection and classification, and recommendation engines, as well as to run HPC applications like pharmaceutical discovery and more.
Learn more
CoreWeave
CoreWeave is a cloud infrastructure provider specializing in GPU-based compute solutions tailored for AI workloads. The platform offers scalable, high-performance GPU clusters that optimize the training and inference of AI models, making it ideal for industries like machine learning, visual effects (VFX), and high-performance computing (HPC). CoreWeave provides flexible storage, networking, and managed services to support AI-driven businesses, with a focus on reliability, cost efficiency, and enterprise-grade security. The platform is used by AI labs, research organizations, and businesses to accelerate their AI innovations.
Learn more