SmolVLA fine-tuning

SmolVLA is Hugging Face LeRobot’s lightweight VLA foundation model. This guide targets LeRobot v0.5.0. The official doc LeRobot v0.5.0 SmolVLA uses lerobot/smolvla_base as the pretrained checkpoint for LeRobot-dataset fine-tuning; the model card lerobot/smolvla_base lists multi-view images, robot state, optional language, and continuous actions as inputs.

If you want a reliable first baseline, SmolVLA is usually easier to bootstrap than Pi0/Pi0.5: it is still a VLA, but the environment and commands stay inside the official LeRobot stack.

IO-AI publishes ioaitech/lerobot-gpu:v0.5.0, which bundles LeRobot v0.5.0, GPU training deps, a video decoding stack, and lerobot-train—ready for SmolVLA fine-tuning. Older tags (v0.4.4, v0.3.3) exist for historical reproduction only; new work should default to v0.5.0.

Published image tags on Docker Hub:

ioaitech/lerobot-gpu:v0.5.0
ioaitech/lerobot-gpu:v0.4.4
ioaitech/lerobot-gpu:v0.3.3

Version selection

Match dataset format (codebase_version in meta/info.json) to the training image.

Dataset format	Recommended image	Notes
`v3.0`	`ioaitech/lerobot-gpu:v0.5.0`	Default path; matches official SmolVLA commands.
`v3.0`	`ioaitech/lerobot-gpu:v0.4.4`	Older v0.4 training stack reproduction.
`v2.1`	`ioaitech/lerobot-gpu:v0.3.3`	Legacy datasets / historical runs.

Prefer exporting or converting new work to v3.0 to reduce compatibility surface.

Data requirements

The official doc suggests ~50 high-quality episodes with repeated coverage per variation (e.g., five cube positions × ten episodes each). A 25-episode variant underperformed in their example—do not blindly copy counts, but treat coverage as more important than raw step count.

At minimum verify:

Image keys are consistent across episodes.
observation.state and action shapes are fixed.
Task strings align with the physical goal.
The train split spans object poses, lighting, and initial configurations.

Training with the image

GPU check:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Smoke test

docker run --rm --gpus all --shm-size 16g \
  -v /path/to/lerobot_dataset:/data/input \
  -v /path/to/output:/outputs \
  ioaitech/lerobot-gpu:v0.5.0 \
  bash -lc 'lerobot-train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=local/my_dataset \
    --dataset.root=/data/input \
    --batch_size=8 \
    --steps=1000 \
    --output_dir=/outputs/smolvla_smoke \
    --job_name=smolvla_smoke \
    --policy.device=cuda \
    --wandb.enable=false'

Fine-tuning template

docker run --rm --gpus all --shm-size 16g \
  -v /path/to/lerobot_dataset:/data/input \
  -v /path/to/output:/outputs \
  ioaitech/lerobot-gpu:v0.5.0 \
  bash -lc 'lerobot-train \
    --policy.path=lerobot/smolvla_base \
    --dataset.repo_id=local/my_dataset \
    --dataset.root=/data/input \
    --batch_size=64 \
    --steps=20000 \
    --output_dir=/outputs/smolvla_finetune \
    --job_name=smolvla_finetune \
    --policy.device=cuda \
    --wandb.enable=false'

The official doc cites ~4 hours on a single A100 for ~20k steps; your wall time depends on resolution, decoding, batch size, and GPU model.

Hub-hosted dataset (drop --dataset.root, set a real repo id):

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=your-name/your-dataset \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/smolvla \
  --job_name=smolvla_finetune \
  --policy.device=cuda \
  --wandb.enable=true

After installing LeRobot locally

Read Install LeRobot first. v0.5.0 requires Python >=3.12; training uses lerobot-train.

git clone https://github.com/huggingface/lerobot.git
cd lerobot
git checkout v0.5.0

python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
pip install -e ".[smolvla]"

Local dataset path:

lerobot-train \
  --policy.path=lerobot/smolvla_base \
  --dataset.repo_id=local/my_dataset \
  --dataset.root=/path/to/lerobot_dataset \
  --batch_size=64 \
  --steps=20000 \
  --output_dir=outputs/train/smolvla_finetune \
  --job_name=smolvla_finetune \
  --policy.device=cuda \
  --wandb.enable=false

To mirror the official example exactly, upload to the Hub and pass --dataset.repo_id=${HF_USER}/mydataset only.

Tuning guidance

Flag	Guidance
`--batch_size`	Start small for VRAM, then increase if stable.
`--steps`	1k smoke test first; ~20k as a first serious baseline.
`--policy.device`	Use `cuda` on NVIDIA; CPU is for plumbing checks only.
`--wandb.enable`	Enable for long/shared runs; disable for quick local tests.
`--output_dir`	Unique directory per experiment to avoid overwriting checkpoints.

Avoid editing the architecture early. Fix data + the default command until loss, checkpoint load, and a tiny inference sanity check behave, then consider PEFT / LR / horizon tweaks.

Quick inference sanity check

After training, load a checkpoint once to catch key/shape mismatches:

import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "/path/to/output/smolvla_finetune/checkpoints/last",
    device="cuda",
)
policy.eval()

observation = {
    "observation.images.front": image_tensor,
    "observation.state": state_tensor,
}

with torch.no_grad():
    action = policy.select_action(observation)

print(getattr(action, "shape", None))

If this fails, fix dataset camera keys and observation.state shape before chasing hyperparameters.

Troubleshooting

`--policy.path` differs from older docs

LeRobot v0.5.0 uses --policy.path=lerobot/smolvla_base. Older snippets may show --policy.type smolvla or --policy.pretrained_path; pin the git tag when reproducing legacy configs.

Video decode errors during training

LeRobot expects ffmpeg and compatible PyTorch/TorchCodec builds. The GPU image usually includes them; for source installs, verify versions first.

Loss improves but hardware fails

Audit data coverage and task-text consistency. SmolVLA fine-tuning will not invent missing object poses, camera angles, or recovery behaviors.

Version selection​

Data requirements​

Training with the image​

Smoke test​

Fine-tuning template​

After installing LeRobot locally​

Tuning guidance​

Quick inference sanity check​

Troubleshooting​

--policy.path differs from older docs​

Video decode errors during training​

Loss improves but hardware fails​

References​