Fine-tuning SmolVLA Model
SmolVLA (Small Vision-Language-Action) is a lightweight VLA model (approx. 450M parameters) released by Hugging Face. If you want to quickly walk through the "data → training → inference" process on a single card or consumer-grade GPU, SmolVLA is usually the easiest starting point.
This article will guide you through three things:
- Preparing LeRobot format data (local or Hub)
- Starting fine-tuning training (single-card/multi-card/memory-optimized)
- Performing a minimal inference validation with the trained model
Prerequisites
System Requirements
- OS: Linux (Ubuntu 20.04+ recommended) or macOS.
- Python Version: 3.10+ recommended (matches common LeRobot dependencies better).
- GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM.
- Memory: At least 16GB RAM.
- Storage: At least 50GB available space.
Environment Preparation
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment (venv recommended; conda also works)
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
# Install dependencies (SmolVLA-related dependencies are usually in extras)
pip install -e ".[smolvla]"
2. Install Additional Dependencies
# Install Flash Attention (Optional: typically only for Linux + NVIDIA CUDA environments; not applicable for macOS/CPU)
pip install flash-attn --no-build-isolation
# Install Weights & Biases (for experiment tracking; optional)
pip install wandb
If you don't want to use W&B for now, you can skip wandb login. It won't affect training (just disable --wandb.enable later).
Data Preparation
LeRobot Format Data
SmolVLA uses the LeRobot dataset format. You need at least the following files/directories (names might vary slightly depending on export method, but the core is meta.json + data/):
your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...
Data Quality Requirements
According to Hugging Face recommendations, SmolVLA needs:
- At least 25 high-quality episodes to achieve good performance.
- 100+ episodes recommended for optimal results.
- Each episode should contain a complete task execution process.
- Image resolution recommended at 224x224 or 256x256.
Fine-tuning Training
Basic Training Command
# Clearly define dataset "ID" and "path" to avoid command errors
# - Local dataset: DATASET_ID as local/xxx, and provide DATASET_ROOT
# - Hub dataset: DATASET_ID as your-name/your-repo, and remove --dataset.root
DATASET_ID=local/mylerobot3
DATASET_ROOT=~/Downloads/mylerobot3
export CUDA_VISIBLE_DEVICES=0
# Start SmolVLA fine-tuning
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 64 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_finetuning \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.scheduler_warmup_steps 1000 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 5000 \
--wandb.enable true \
--wandb.project smolvla_finetuning
If using a Hub dataset (e.g., your-name/your-repo), change DATASET_ID to that value and remove the --dataset.root ... line.
Advanced Training Configurations
Multi-GPU Training
# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
$(which lerobot-train) \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 32 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_multi_gpu \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Memory-Optimized Configuration
# For GPUs with smaller VRAM
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 16 \
--steps 30000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_memory_optimized \
--policy.device cuda \
--policy.optimizer_lr 5e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Parameter Details
Core Parameters
| Parameter | Meaning | Recommended | Description |
|---|---|---|---|
--policy.type | Policy type | smolvla | SmolVLA model type |
--policy.pretrained_path | Pre-trained model path | lerobot/smolvla_base | Official pre-trained model on HuggingFace |
--dataset.repo_id | Dataset ID | local/mylerobot3 | Use local/xxx for local training; or your-name/your-repo (Hub) |
--dataset.root | Dataset local path | ~/Downloads/mylerobot3 | Only needed for local datasets (dir containing meta.json, data/, etc.) |
--batch_size | Batch size | 64 | Adjust based on VRAM, RTX 3080 recommended 32-64 |
--steps | Training steps | 20000 | Can be reduced to 10000 for small datasets |
--output_dir | Output directory | outputs/train/smolvla_finetuned | Model save path |
--job_name | Job name | smolvla_finetuning | For logs and experiment tracking (optional) |
Training Parameters
| Parameter | Meaning | Recommended | Description |
|---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-4 | Can be appropriately lowered during fine-tuning |
--policy.scheduler_warmup_steps | Warmup steps | 1000 | LR warmup for stable training |
--policy.use_amp | Use AMP | true | Saves VRAM and accelerates training |
--policy.optimizer_grad_clip_norm | Grad clipping | 1.0 | Prevents gradient explosion |
--num_workers | Data workers | 4 | Adjust based on CPU cores |
--policy.push_to_hub | Push to Hub | false | Whether to upload model to HuggingFace (requires repo_id) |
--save_checkpoint | Save checkpoint | true | Whether to save training checkpoints |
--save_freq | Save frequency | 5000 | Interval for saving checkpoints |
Model Specific Parameters
| Parameter | Meaning | Recommended | Description |
|---|---|---|---|
--policy.vlm_model_name | VLM backbone | HuggingFaceTB/SmolVLM2-500M-Video-Instruct | VLM model used by SmolVLA |
--policy.chunk_size | Action chunk size | 50 | Predicted action sequence length |
--policy.n_action_steps | Execution steps | 50 | Number of actions executed each time |
--policy.n_obs_steps | Observation history steps | 1 | Number of historical observation frames used |
Training Monitoring
W&B is very useful if you like watching curves; if you just want to run the process first, you can skip it.
- Enable W&B (Optional): Add the following to your training command:
--wandb.enable true--wandb.project smolvla_experiments- (Optional)
--wandb.notes "your note"
- Recommended Metrics to Monitor:
- Loss / Action Loss: Whether it's decreasing steadily and not diverging.
- Learning Rate: Whether it's as expected after warmup.
- GPU Memory: Whether it's close to the limit (helps decide whether to lower batch size / enable AMP).
Quick Validation (More reliable than writing evaluation scripts)
Often, you don't need to write a full evaluation suite first: run inference with a single frame of real input to quickly clear 80% of pitfalls (field keys, dimensions, dtype, normalization, etc.).
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained(
"outputs/train/smolvla_finetuned/checkpoints/last",
device="cuda",
)
policy.eval()
# Recommended to take a "real frame" from your dataset for the following two:
# image_tensor: [1, 3, H, W] float32 (typically 0~1)
# state_tensor: [1, state_dim] float32
observation = {
"observation.images.cam_high": image_tensor,
"observation.state": state_tensor,
}
with torch.no_grad():
action = policy.select_action(observation)
print("action shape:", getattr(action, "shape", None))
If you get a key error/shape error here, check two things first: dataset field names (e.g., if cam_high matches) and state/action dimensions (if gripper is missing/extra, or joint count is consistent).
Best Practices (Simplified)
- Data: Better to have fewer but "stable and consistent" data (same camera naming / same state/action definition).
- Starting Hyperparameters: Start with a configuration that runs (e.g.,
batch_size=16/32), confirm loss is normal before increasing. - Memory Saving Trio:
--policy.use_amp true, decrease--batch_size, decrease--num_workers(priority on stability). - Checkpoints: Keep
checkpoints/lastfor easier inference and regression testing.
FAQ
- Q: Training runs fine, but inference reports field key mismatch?
- A: Align camera keys (e.g.,
cam_high) andobservation.statedimensions in the dataset first; such issues are usually due to inconsistent input mapping rather than a "broken model."
- A: Align camera keys (e.g.,
- Q: OOM right at the start?
- A: Lower
--batch_sizeto 16/8, then enable--policy.use_amp true; also lower--num_workersto 2 or 1.
- A: Lower
- Q: Loss doesn't decrease much?
- A: Confirm if data is "truly aligned" (state/action dimensions, time alignment, action smoothness) before considering LR or training step adjustments.