Logotrain
Experiment #47 — Running

From pip install
to production model.
One coffee.

GPU clusters auto-scaled. Hyperparameters tuned. Every experiment versioned. No YAML. No babysitting. Just results.

Experiment #47 complete

accuracy 96.2% · loss 0.041

train run --config minimal.yaml --gpu auto

Training Loss

2.341

Epoch

01/60

2.41.60.80.0
015304560
Accuracy71.4%

GPU Cluster

GPU 0
94%
GPU 1
91%
GPU 2
97%
GPU 3
88%

Live Log

[INFO]epoch 1 complete
[INFO]lr=2.4e-4 · bs=128
[SAVE]checkpoint saved

Config (4 lines)

model: resnet50
data: ./dataset
epochs: 60
gpu: auto
The Reality

Sound familiar?

Three ways MLOps toolchains are quietly burning your runway.

01

Fragmented toolchain

Weights & Biases for tracking. Ray for distributed training. Airflow for pipelines. Kubernetes for scaling. Four dashboards. Zero sanity.

W&BRayAirflowK8sMLflowDVCSeldon+12 more
02

Cloud spend leakage

GPUs idling between runs. Spot instances terminated mid-epoch. $4,200 in AWS charges last month with 38% utilization.

M
T
W
T
F
S
S
GPU utilizationavg 38%
03

Config file hell

A 200-line YAML that took 3 days to tune, can't be reproduced by your new hire, and breaks when the dataset changes.

1model_config:
2 architecture: resnet50
3 pretrained: true
4 num_classes: 1000
5 dropout: 0.3
6training:
7 optimizer: adamw
8 lr_scheduler: cosine
9 warmup_steps: 500
10 gradient_clip: 1.0
11 # ... 190 more lines
$ train run . —— what if it was one command?

The fix ships tonight.

Hover any card to see the metrics. These aren't mockups.

01

Auto-scaling GPU allocation

Declare your model. We provision the cluster. Spot instance fallback keeps costs 60% below on-demand.

$0.42/GPU·hr

avg cost

Performance metrics

Avg provisioning time< 90s
Spot fallback rate99.7%
Cost vs on-demand−61%
Max cluster size512 GPUs

Real data from production clusters · Updated daily

02

Experiment versioning

Every run is a git commit. Diff any two experiments. Reproduce any result with one command. No more notebook archaeology.

100%

reproducible

Performance metrics

Params trackedAll
Artifact storageAuto
Diff resolutionLine-level
Rollback time< 5s

Real data from production clusters · Updated daily

03

One-click deployment

Best checkpoint detected automatically. REST endpoint live in 30 seconds. Versioned, monitored, rollback-ready.

30s

to live API

Performance metrics

Avg deploy time28s
Uptime SLA99.95%
Auto-scalingEnabled
RollbackOne command

Real data from production clusters · Updated daily

Integrations

Slots into your stack.
Doesn't replace it.

Train speaks PyTorch, HuggingFace, and W&B natively. Your existing code runs unchanged — you just stop managing infrastructure.

PY
PyTorch

Native autograd support

compatible
HU
HuggingFace

Transformers & datasets

compatible
W&
W&B

Experiment syncing

compatible
JA
JAX

XLA compilation

compatible
TE
TensorFlow

Keras & TF2 support

compatible
LI
Lightning

PyTorch Lightning

compatible
DV
DVC

Data versioning

compatible
ON
ONNX

Model export

compatible
migration.diff
- import wandb
- import ray
- from kubernetes import client, config
- wandb.init(project="my-model", entity="team")
- ray.init(address="ray://cluster:10001")
- # ... 47 more setup lines
 
+ import train
 
train.run(model, dataset, epochs=60)
# That's it. Seriously.

12,847

+342

Models trained today

94,201

+1,204

GPU·hours saved

2.1M

+8.4K

Experiments tracked

+3.7pp

vs baseline

Avg accuracy gain

PyTorch · HuggingFace · W&B · JAX · Lightning · ONNX · DVC · TensorFlow
PyTorch · HuggingFace · W&B · JAX · Lightning · ONNX · DVC · TensorFlow
Get started

Your model could be
training in 90 seconds.

Install the CLI, point it at your dataset, and watch the loss curve bend. No account required for the first run.

Install
$pip install train-cli
Python ≥ 3.9pip ≥ 22.0Free tier available
quickstart — 90 seconds
$pip install train-cli
$train init my-model
Scanning dataset... found 142,000 samples
Provisioning 4× A100 cluster...
Training epoch 1/60 ━━━━━━━━━━━━ 100%
Model deployed: https://api.train.run/v1/my-model

5 GPU·hrs/mo

Free tier

to start

No credit card

no lock-in

Cancel anytime

certified

SOC 2 Type II