Amazon EC2 UltraClusters
Amazon EC2 UltraClusters enable you to scale to thousands of GPUs or purpose-built machine learning accelerators, such as AWS Trainium, providing on-demand access to supercomputing-class performance. They democratize supercomputing for ML, generative AI, and high-performance computing developers through a simple pay-as-you-go model without setup or maintenance costs. UltraClusters consist of thousands of accelerated EC2 instances co-located in a given AWS Availability Zone, interconnected using Elastic Fabric Adapter (EFA) networking in a petabit-scale nonblocking network. This architecture offers high-performance networking and access to Amazon FSx for Lustre, a fully managed shared storage built on a high-performance parallel file system, enabling rapid processing of massive datasets with sub-millisecond latencies. EC2 UltraClusters provide scale-out capabilities for distributed ML training and tightly coupled HPC workloads, reducing training times.
Learn more
AWS HPC
AWS High Performance Computing (HPC) services empower users to execute large-scale simulations and deep learning workloads in the cloud, providing virtually unlimited compute capacity, high-performance file systems, and high-throughput networking. This suite of services accelerates innovation by offering a broad range of cloud-based tools, including machine learning and analytics, enabling rapid design and testing of new products. Operational efficiency is maximized through on-demand access to compute resources, allowing users to focus on complex problem-solving without the constraints of traditional infrastructure. AWS HPC solutions include Elastic Fabric Adapter (EFA) for low-latency, high-bandwidth networking, AWS Batch for scaling computing jobs, AWS ParallelCluster for simplified cluster deployment, and Amazon FSx for high-performance file systems. These services collectively provide a flexible and scalable environment tailored to diverse HPC workloads.
Learn more
AWS ParallelCluster
AWS ParallelCluster is an open-source cluster management tool that simplifies the deployment and management of High-Performance Computing (HPC) clusters on AWS. It automates the setup of required resources, including compute nodes, a shared filesystem, and a job scheduler, supporting multiple instance types and job submission queues. Users can interact with ParallelCluster through a graphical user interface, command-line interface, or API, enabling flexible cluster configuration and management. The tool integrates with job schedulers like AWS Batch and Slurm, facilitating seamless migration of existing HPC workloads to the cloud with minimal modifications. AWS ParallelCluster is available at no additional charge; users only pay for the AWS resources consumed by their applications. With AWS ParallelCluster, you can use a simple text file to model, provision, and dynamically scale the resources needed for your applications in an automated and secure manner.
Learn more
Rocky Linux
CIQ empowers people to do amazing things by providing innovative and stable software infrastructure solutions for all computing needs. From the base operating system, through containers, orchestration, provisioning, computing, and cloud applications, CIQ works with every part of the technology stack to drive solutions for customers and communities with stable, scalable, secure production environments. CIQ is the founding support and services partner of Rocky Linux, and the creator of the next generation federated computing stack.
- Rocky Linux, open, Secure Enterprise Linux
- Apptainer, application Containers for High Performance Computing
- Warewulf, cluster Management and Operating System Provisioning
- HPC2.0, the Next Generation of High Performance Computing, a Cloud Native Federated Computing Platform
- Traditional HPC, turnkey computing stack for traditional HPC
Learn more