Experiment #47 — Running

From `pip install`
to production model.
One coffee.

GPU clusters auto-scaled. Hyperparameters tuned. Every experiment versioned. No YAML. No babysitting. Just results.

Experiment #47 complete

accuracy 96.2% · loss 0.041

train run --config minimal.yaml --gpu auto▌

Training Loss

2.341

Epoch

01/60

2.41.60.80.0

015304560

Accuracy71.4%

GPU Cluster

GPU 0

94%

GPU 1

91%

GPU 2

97%

GPU 3

88%

Live Log

[INFO]epoch 1 complete

[INFO]lr=2.4e-4 · bs=128

[SAVE]checkpoint saved

Config (4 lines)

model: resnet50
data: ./dataset
epochs: 60
gpu: auto

The Reality

Sound familiar?

Three ways MLOps toolchains are quietly burning your runway.

Fragmented toolchain

Weights & Biases for tracking. Ray for distributed training. Airflow for pipelines. Kubernetes for scaling. Four dashboards. Zero sanity.

W&BRayAirflowK8sMLflowDVCSeldon+12 more

Cloud spend leakage

GPUs idling between runs. Spot instances terminated mid-epoch. $4,200 in AWS charges last month with 38% utilization.

GPU utilizationavg 38%

Config file hell

A 200-line YAML that took 3 days to tune, can't be reproduced by your new hire, and breaks when the dataset changes.

1model_config:

2 architecture: resnet50

3 pretrained: true

4 num_classes: 1000

5 dropout: 0.3

6training:

7 optimizer: adamw

8 lr_scheduler: cosine

9 warmup_steps: 500

10 gradient_clip: 1.0

11 # ... 190 more lines

$ train run . —— what if it was one command?

The fix ships tonight.

Hover any card to see the metrics. These aren't mockups.

Auto-scaling GPU allocation

Declare your model. We provision the cluster. Spot instance fallback keeps costs 60% below on-demand.

$0.42/GPU·hr

avg cost

Performance metrics

Avg provisioning time< 90s

Spot fallback rate99.7%

Cost vs on-demand−61%

Max cluster size512 GPUs

Real data from production clusters · Updated daily

Experiment versioning

Every run is a git commit. Diff any two experiments. Reproduce any result with one command. No more notebook archaeology.

100%

reproducible

Performance metrics

Params trackedAll

Artifact storageAuto

Diff resolutionLine-level

Rollback time< 5s

Real data from production clusters · Updated daily

One-click deployment

Best checkpoint detected automatically. REST endpoint live in 30 seconds. Versioned, monitored, rollback-ready.

30s

to live API

Performance metrics

Avg deploy time28s

Uptime SLA99.95%

Auto-scalingEnabled

RollbackOne command

Real data from production clusters · Updated daily

Integrations

Slots into your stack.
Doesn't replace it.

Train speaks PyTorch, HuggingFace, and W&B natively. Your existing code runs unchanged — you just stop managing infrastructure.

PyTorch

Native autograd support

compatible

HuggingFace

Transformers & datasets

compatible

W&B

Experiment syncing

compatible

JAX

XLA compilation

compatible

TensorFlow

Keras & TF2 support

compatible

Lightning

PyTorch Lightning

compatible

DVC

Data versioning

compatible

ONNX

Model export

compatible

migration.diff

- import wandb

- import ray

- from kubernetes import client, config

- wandb.init(project="my-model", entity="team")

- ray.init(address="ray://cluster:10001")

- # ... 47 more setup lines

+ import train

train.run(model, dataset, epochs=60)

# That's it. Seriously.

12,847

+342

Models trained today

94,201

+1,204

GPU·hours saved

2.1M

+8.4K

Experiments tracked

+3.7pp

vs baseline

Avg accuracy gain

PyTorch · HuggingFace · W&B · JAX · Lightning · ONNX · DVC · TensorFlow

Get started

Your model could be
training in 90 seconds.

Install the CLI, point it at your dataset, and watch the loss curve bend. No account required for the first run.

Install

$pip install train-cli

Or try in browser — no install needed →

Python ≥ 3.9pip ≥ 22.0Free tier available

quickstart — 90 seconds

$pip install train-cli

$train init my-model

→Scanning dataset... found 142,000 samples

→Provisioning 4× A100 cluster...

→Training epoch 1/60 ━━━━━━━━━━━━ 100%

✓Model deployed: https://api.train.run/v1/my-model

5 GPU·hrs/mo

Free tier

to start

No credit card

no lock-in

Cancel anytime

certified

SOC 2 Type II

From pip installto production model.One coffee.

Sound familiar?

Fragmented toolchain

Cloud spend leakage

Config file hell

The fix ships tonight.

Auto-scaling GPU allocation

Experiment versioning

One-click deployment

Slots into your stack.Doesn't replace it.

Your model could betraining in 90 seconds.

From `pip install`
to production model.
One coffee.

Slots into your stack.
Doesn't replace it.

Your model could be
training in 90 seconds.