Reliable cross-cloud GPU training

Your training job survives
any cloud failure.

VaultLayer routes ML training across cloud providers and auto-resumes from R2 checkpoints when an instance dies. Zero changes to your training code.

Get early access See live data

100%

job completion
via cross-provider failover

28×

price spread we exploit
for the same training run

lines of training code
to change

Why we exist

Spot GPUs are 60–90% cheaper. They die mid-run.

Every fine-tuning team faces the same trilemma: pick a single cloud and overpay; pick spot and lose hours of training to preemption; or build your own checkpoint+failover infra. VaultLayer is that infra — drop-in, multi-cloud, and battle-tested across 3 providers today.

🛡️Reliability

When your instance dies — Vast.ai bid lost, GCP spot preempted, RunPod community pod reclaimed — VaultLayer's broker detects the missed heartbeat, fences the dead instance, and resumes from your last R2 checkpoint on a fresh instance. Same job ID, same training state.

VAULTLAYER'S CORE MOAT

💾Checkpoint + Resume

Your model state writes directly to your Cloudflare R2 bucket every N steps. Job-scoped credentials prevent cross-job reads. When training resumes anywhere — same provider, different provider — it picks up at the exact step it left off.

CROSS-PROVIDER PORTABLE

⚡Cost arbitrage

Same workload on Vast.ai is 28× cheaper than on Lambda Labs (real numbers from our test matrix). VaultLayer routes to the cheapest validated GPU available, and the reliability layer means cheap-but-flaky becomes cheap-and-reliable.

SAVINGS WITHOUT THE TAX

How it works

Drop in. Don't change your code.

Wrap your existing training script. We handle provisioning, checkpointing, and recovery.

# 1. one-time setup
pip install vaultlayer
vaultlayer init

# 2. run any training script
vaultlayer run --gpu A100_40 python train.py

Submit your training script

Wraps any PyTorch / JAX / HuggingFace script. No SDK to import. No magic decorators. Your script doesn't know it's being managed.

Cheapest validated GPU wins

The dispatcher considers an allowlist of validated providers (today: Vast.ai, RunPod, Lambda Labs) and picks by price + capacity. Your job lands within seconds.

Auto-checkpoint to R2

Standard HuggingFace save_steps writes flow to your Cloudflare R2 bucket on a job-scoped credential. Master keys never leave the broker.

Instance dies → resume

Heartbeat watchdog detects the missing instance. Broker re-provisions on the same or next-best provider. New instance reads the R2 checkpoint and continues from the last saved step.

Live data — not promises

Same Qwen2.5-7B QLoRA run, three providers, 28× price spread.

All numbers below are real jobs we ran on 2026-04-20 against the same model, same dataset, same step count. Job IDs verifiable in our broker. The spread between Vast.ai and Lambda is the gap our routing layer closes.

Provider	GPU	Pure training	Wall-clock	Cost
Vast.ai marketplace	A100-PCIE-40GB	7.2 min	10.9 min	$0.017
RunPod community pods	RTX 4090 (24 GB)	8.1 min	11.1 min	$0.128
Lambda Labs on-demand	A100-SXM4-40GB	6.5 min	14.3 min	$0.475
vs AWS p4d.24xlarge on-demand baseline	A100-SXM4-40GB	—	~12 min est.	~$5.90

Workload: TinyLlama / Qwen2.5-7B QLoRA fine-tune on tatsu-lab/alpaca, 100 steps, batch 4 × grad-accum 4, MAX_SEQ_LEN 512. Same training script across all providers, run via the public training-base:1.0 image.

Provider status

Conservative allowlist. Validated end-to-end.

A provider only enters the dispatcher's routing pool after a real model finishes training on it from cold-start. Today we route to three; the rest are pending validation, blocked by quota, or untested.

✓ Included — actively routed

Vast.ai RunPod Lambda Labs

TinyLlama 1.1B + Qwen2.5-7B QLoRA passing on all three. Test matrix in our repo records job IDs and timings.

○ Pending validation

AWS GCP CoreWeave Crusoe Nebius EU Voltage Park Hyperstack Azure

AWS + GCP are quota-blocked at the account level (resolution in progress). The other 5 are awaiting end-to-end smoke runs. Movement happens via a one-line config edit after a green sweep.

Currently in private beta.

Drop your email to get early access when we open up the Summer 2026 cohort. We'll only email you when we launch — no marketing drips.

Or read the code: github.com/hector25/vaultlayer

Your training job survivesany cloud failure.