AI Superpod Clusters

Ultra-massive GPU arrays.

Pre-built clusters of hundreds to thousands of interconnected NVIDIA H100/H200 GPUs. InfiniBand fabric, parallel storage, and job scheduling — ready to train the largest models.

AI TENSOR SUPERPOD CLUSTERSLURM / K8S CONTROLTRAINING JOBSLLVM-70B-1LLVM-70B-2LLVM-70B-3ALLOCATING...400G INFINIBANDNVIDIA HGX H200 (NODE A)NVLINK 900 GB/s TERA-FABRICNVIDIA HGX H200 (NODE B)NVLINK 900 GB/s TERA-FABRICUp to 16,384+ GPUsZero Throttling

Up to 16,384 GPUs

Scale

400 Gbps

InfiniBand

1T+ parameters

Ready for

99.9%

Uptime SLA

Machine families

Purpose-built configurations for every workload profile — from web serving to GPU-accelerated ML training.

SP-H200 / SP-H100

Superpod Configurations

Pre-configured GPU superpods with optimized network topology, parallel storage, and job scheduling for training the world's largest AI models.

Foundation model trainingMulti-modal modelsScientific AISovereign AI
View all configurations

GPU

H200 / H100

Max GPUs

16,384

InfiniBand

Fat tree NDR

Storage

DAOS / Lustre

Infrastructure for frontier AI.

Turn-key GPU superpods with optimized networking, storage, and scheduling.

Up to 16,384 GPUs

Pre-built superpod clusters from 256 to 16,384 GPUs with optimized fat-tree InfiniBand topology for all-to-all communication.

InfiniBand fat-tree fabric

400 Gbps NDR InfiniBand with non-blocking fat-tree topology. Sub-microsecond MPI latency across the entire cluster.

Parallel storage system

DAOS or Lustre file system delivering 1+ TB/s aggregate read throughput for checkpoint I/O and training data.

Training frameworks

Pre-tuned Megatron-LM, DeepSpeed, and FSDP configurations. NCCL topology files generated for your cluster.

Automatic health monitoring

GPU health checks, InfiniBand link monitoring, and automatic node replacement for failed GPUs.

Job scheduling

Managed Slurm scheduler with priority queues, job preemption, and multi-tenant access control.

Getting started

Launch your first instance in three steps. CLI, console, or API — your choice.

Terminal
ur superpod request \
  --config=sp-h200-256 \
  --duration=3months \
  --storage=daos-500tb

Train the largest models.

Foundation models, multi-modal AI, and sovereign AI infrastructure.

Train foundation models

Train GPT, Llama, and Mixtral-class models from scratch. 256-4096 GPU clusters with optimized 3D parallelism.

View tutorial

Suggested configuration

SP-H200-1024 · Megatron-LM · DAOS

Estimate your costs

Create detailed configurations to see exactly how much your architecture will cost. Pay for what you use, down to the second.

Configuration 1

Estimated: $830.72/mo

Platform & Architecture

Compute Resources

GB

Storage

GB

Cost Optimization

Spot GPU ClusterSave up to 70% — may be reclaimed
Config 1 cost$830.72

Cost details

$830.72

InfiniBand networking included. Volume discounts for 100+ GPUs.

Configuration 1
$830.72
8 vCPU × 80 GB Compute$630.72
Persistent Storage$200.00

Works seamlessly with

GPU Instances
Parallel Storage
Model Registry
Cloud Monitoring
IAM
Cloud Logging

Frequently asked questions

Train at any scale.

Pre-built GPU superpods from 256 to 16,384 GPUs. Request access today.