SmolVLA fine-tuning
SmolVLA is Hugging Face LeRobot’s lightweight VLA foundation model. This guide targets LeRobot v0.5.0. The official doc LeRobot v0.5.0 SmolVLA uses lerobot/smolvla_base as the pretrained checkpoint for LeRobot-dataset fine-tuning; the model card lerobot/smolvla_base lists multi-view images, robot state, optional language, and continuous actions as inputs.
If you want a reliable first baseline, SmolVLA is usually easier to bootstrap than Pi0/Pi0.5: it is still a VLA, but the environment and commands stay inside the official LeRobot stack.
IO-AI publishes ioaitech/lerobot-gpu:v0.5.0, which bundles LeRobot v0.5.0, GPU training deps, a video decoding stack, and lerobot-train—ready for SmolVLA fine-tuning. Older tags (v0.4.4, v0.3.3) exist for historical reproduction only; new work should default to v0.5.0.
Published image tags on Docker Hub:
ioaitech/lerobot-gpu:v0.5.0ioaitech/lerobot-gpu:v0.4.4ioaitech/lerobot-gpu:v0.3.3
Version selection
Match dataset format (codebase_version in meta/info.json) to the training image.
| Dataset format | Recommended image | Notes |
|---|---|---|
v3.0 | ioaitech/lerobot-gpu:v0.5.0 | Default path; matches official SmolVLA commands. |
v3.0 | ioaitech/lerobot-gpu:v0.4.4 | Older v0.4 training stack reproduction. |
v2.1 | ioaitech/lerobot-gpu:v0.3.3 | Legacy datasets / historical runs. |
Prefer exporting or converting new work to v3.0 to reduce compatibility surface.
Data requirements
The official doc suggests ~50 high-quality episodes with repeated coverage per variation (e.g., five cube positions × ten episodes each). A 25-episode variant underperformed in their example—do not blindly copy counts, but treat coverage as more important than raw step count.
At minimum verify:
- Image keys are consistent across episodes.
observation.stateandactionshapes are fixed.- Task strings align with the physical goal.
- The train split spans object poses, lighting, and initial configurations.
Training with the image
GPU check:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Smoke test
docker run --rm --gpus all --shm-size 16g \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/outputs \
ioaitech/lerobot-gpu:v0.5.0 \
bash -lc 'lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=local/my_dataset \
--dataset.root=/data/input \
--batch_size=8 \
--steps=1000 \
--output_dir=/outputs/smolvla_smoke \
--job_name=smolvla_smoke \
--policy.device=cuda \
--wandb.enable=false'
Fine-tuning template
docker run --rm --gpus all --shm-size 16g \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/outputs \
ioaitech/lerobot-gpu:v0.5.0 \
bash -lc 'lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=local/my_dataset \
--dataset.root=/data/input \
--batch_size=64 \
--steps=20000 \
--output_dir=/outputs/smolvla_finetune \
--job_name=smolvla_finetune \
--policy.device=cuda \
--wandb.enable=false'
The official doc cites ~4 hours on a single A100 for ~20k steps; your wall time depends on resolution, decoding, batch size, and GPU model.
Hub-hosted dataset (drop --dataset.root, set a real repo id):
lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=your-name/your-dataset \
--batch_size=64 \
--steps=20000 \
--output_dir=outputs/train/smolvla \
--job_name=smolvla_finetune \
--policy.device=cuda \
--wandb.enable=true
After installing LeRobot locally
Read Install LeRobot first. v0.5.0 requires Python >=3.12; training uses lerobot-train.
git clone https://github.com/huggingface/lerobot.git
cd lerobot
git checkout v0.5.0
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[smolvla]"
Local dataset path:
lerobot-train \
--policy.path=lerobot/smolvla_base \
--dataset.repo_id=local/my_dataset \
--dataset.root=/path/to/lerobot_dataset \
--batch_size=64 \
--steps=20000 \
--output_dir=outputs/train/smolvla_finetune \
--job_name=smolvla_finetune \
--policy.device=cuda \
--wandb.enable=false
To mirror the official example exactly, upload to the Hub and pass --dataset.repo_id=${HF_USER}/mydataset only.
Tuning guidance
| Flag | Guidance |
|---|---|
--batch_size | Start small for VRAM, then increase if stable. |
--steps | 1k smoke test first; ~20k as a first serious baseline. |
--policy.device | Use cuda on NVIDIA; CPU is for plumbing checks only. |
--wandb.enable | Enable for long/shared runs; disable for quick local tests. |
--output_dir | Unique directory per experiment to avoid overwriting checkpoints. |
Avoid editing the architecture early. Fix data + the default command until loss, checkpoint load, and a tiny inference sanity check behave, then consider PEFT / LR / horizon tweaks.
Quick inference sanity check
After training, load a checkpoint once to catch key/shape mismatches:
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained(
"/path/to/output/smolvla_finetune/checkpoints/last",
device="cuda",
)
policy.eval()
observation = {
"observation.images.front": image_tensor,
"observation.state": state_tensor,
}
with torch.no_grad():
action = policy.select_action(observation)
print(getattr(action, "shape", None))
If this fails, fix dataset camera keys and observation.state shape before chasing hyperparameters.
Troubleshooting
--policy.path differs from older docs
LeRobot v0.5.0 uses --policy.path=lerobot/smolvla_base. Older snippets may show --policy.type smolvla or --policy.pretrained_path; pin the git tag when reproducing legacy configs.
Video decode errors during training
LeRobot expects ffmpeg and compatible PyTorch/TorchCodec builds. The GPU image usually includes them; for source installs, verify versions first.
Loss improves but hardware fails
Audit data coverage and task-text consistency. SmolVLA fine-tuning will not invent missing object poses, camera angles, or recovery behaviors.