VaultLayer routes ML training across cloud providers and auto-resumes from R2 checkpoints when an instance dies. Zero changes to your training code.
When your instance dies — Vast.ai bid lost, GCP spot preempted, RunPod community pod reclaimed — VaultLayer's broker detects the missed heartbeat, fences the dead instance, and resumes from your last R2 checkpoint on a fresh instance. Same job ID, same training state.
VAULTLAYER'S CORE MOATYour model state writes directly to your Cloudflare R2 bucket every N steps. Job-scoped credentials prevent cross-job reads. When training resumes anywhere — same provider, different provider — it picks up at the exact step it left off.
CROSS-PROVIDER PORTABLESame workload on Vast.ai is 28× cheaper than on Lambda Labs (real numbers from our test matrix). VaultLayer routes to the cheapest validated GPU available, and the reliability layer means cheap-but-flaky becomes cheap-and-reliable.
SAVINGS WITHOUT THE TAX# 1. one-time setup
pip install vaultlayer
vaultlayer init
# 2. run any training script
vaultlayer run --gpu A100_40 python train.py
Wraps any PyTorch / JAX / HuggingFace script. No SDK to import. No magic decorators. Your script doesn't know it's being managed.
The dispatcher considers an allowlist of validated providers (today: Vast.ai, RunPod, Lambda Labs) and picks by price + capacity. Your job lands within seconds.
Standard HuggingFace save_steps writes flow to your Cloudflare R2 bucket on a job-scoped credential. Master keys never leave the broker.
Heartbeat watchdog detects the missing instance. Broker re-provisions on the same or next-best provider. New instance reads the R2 checkpoint and continues from the last saved step.
| Provider | GPU | Pure training | Wall-clock | Cost |
|---|---|---|---|---|
|
Vast.ai
marketplace
|
A100-PCIE-40GB | 7.2 min | 10.9 min | $0.017 |
|
RunPod
community pods
|
RTX 4090 (24 GB) | 8.1 min | 11.1 min | $0.128 |
|
Lambda Labs
on-demand
|
A100-SXM4-40GB | 6.5 min | 14.3 min | $0.475 |
|
vs AWS p4d.24xlarge
on-demand baseline
|
A100-SXM4-40GB | — | ~12 min est. | ~$5.90 |
Workload: TinyLlama / Qwen2.5-7B QLoRA fine-tune on tatsu-lab/alpaca, 100 steps, batch 4 × grad-accum 4, MAX_SEQ_LEN 512. Same training script across all providers, run via the public training-base:1.0 image.
TinyLlama 1.1B + Qwen2.5-7B QLoRA passing on all three. Test matrix in our repo records job IDs and timings.
AWS + GCP are quota-blocked at the account level (resolution in progress). The other 5 are awaiting end-to-end smoke runs. Movement happens via a one-line config edit after a green sweep.
Drop your email to get early access when we open up the Summer 2026 cohort. We'll only email you when we launch — no marketing drips.
Or read the code: github.com/hector25/vaultlayer