Skip to main content

Fine-tuning SmolVLA Model

SmolVLA (Small Vision-Language-Action) is a lightweight VLA model (approx. 450M parameters) released by Hugging Face. If you want to quickly walk through the "data → training → inference" process on a single card or consumer-grade GPU, SmolVLA is usually the easiest starting point.

This article will guide you through three things:

  • Preparing LeRobot format data (local or Hub)
  • Starting fine-tuning training (single-card/multi-card/memory-optimized)
  • Performing a minimal inference validation with the trained model

Prerequisites

System Requirements

  • OS: Linux (Ubuntu 20.04+ recommended) or macOS.
  • Python Version: 3.10+ recommended (matches common LeRobot dependencies better).
  • GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM.
  • Memory: At least 16GB RAM.
  • Storage: At least 50GB available space.

Environment Preparation

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment (venv recommended; conda also works)
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

# Install dependencies (SmolVLA-related dependencies are usually in extras)
pip install -e ".[smolvla]"

2. Install Additional Dependencies

# Install Flash Attention (Optional: typically only for Linux + NVIDIA CUDA environments; not applicable for macOS/CPU)
pip install flash-attn --no-build-isolation

# Install Weights & Biases (for experiment tracking; optional)
pip install wandb
tip

If you don't want to use W&B for now, you can skip wandb login. It won't affect training (just disable --wandb.enable later).

Data Preparation

LeRobot Format Data

SmolVLA uses the LeRobot dataset format. You need at least the following files/directories (names might vary slightly depending on export method, but the core is meta.json + data/):

your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...

Data Quality Requirements

According to Hugging Face recommendations, SmolVLA needs:

  • At least 25 high-quality episodes to achieve good performance.
  • 100+ episodes recommended for optimal results.
  • Each episode should contain a complete task execution process.
  • Image resolution recommended at 224x224 or 256x256.

Fine-tuning Training

Basic Training Command

# Clearly define dataset "ID" and "path" to avoid command errors
# - Local dataset: DATASET_ID as local/xxx, and provide DATASET_ROOT
# - Hub dataset: DATASET_ID as your-name/your-repo, and remove --dataset.root
DATASET_ID=local/mylerobot3
DATASET_ROOT=~/Downloads/mylerobot3

export CUDA_VISIBLE_DEVICES=0

# Start SmolVLA fine-tuning
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 64 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_finetuning \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.scheduler_warmup_steps 1000 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 5000 \
--wandb.enable true \
--wandb.project smolvla_finetuning
note

If using a Hub dataset (e.g., your-name/your-repo), change DATASET_ID to that value and remove the --dataset.root ... line.

Advanced Training Configurations

Multi-GPU Training

# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
$(which lerobot-train) \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 32 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_multi_gpu \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Memory-Optimized Configuration

# For GPUs with smaller VRAM
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${DATASET_ID} \
--dataset.root ${DATASET_ROOT} \
--batch_size 16 \
--steps 30000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_memory_optimized \
--policy.device cuda \
--policy.optimizer_lr 5e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Parameter Details

Core Parameters

ParameterMeaningRecommendedDescription
--policy.typePolicy typesmolvlaSmolVLA model type
--policy.pretrained_pathPre-trained model pathlerobot/smolvla_baseOfficial pre-trained model on HuggingFace
--dataset.repo_idDataset IDlocal/mylerobot3Use local/xxx for local training; or your-name/your-repo (Hub)
--dataset.rootDataset local path~/Downloads/mylerobot3Only needed for local datasets (dir containing meta.json, data/, etc.)
--batch_sizeBatch size64Adjust based on VRAM, RTX 3080 recommended 32-64
--stepsTraining steps20000Can be reduced to 10000 for small datasets
--output_dirOutput directoryoutputs/train/smolvla_finetunedModel save path
--job_nameJob namesmolvla_finetuningFor logs and experiment tracking (optional)

Training Parameters

ParameterMeaningRecommendedDescription
--policy.optimizer_lrLearning rate1e-4Can be appropriately lowered during fine-tuning
--policy.scheduler_warmup_stepsWarmup steps1000LR warmup for stable training
--policy.use_ampUse AMPtrueSaves VRAM and accelerates training
--policy.optimizer_grad_clip_normGrad clipping1.0Prevents gradient explosion
--num_workersData workers4Adjust based on CPU cores
--policy.push_to_hubPush to HubfalseWhether to upload model to HuggingFace (requires repo_id)
--save_checkpointSave checkpointtrueWhether to save training checkpoints
--save_freqSave frequency5000Interval for saving checkpoints

Model Specific Parameters

ParameterMeaningRecommendedDescription
--policy.vlm_model_nameVLM backboneHuggingFaceTB/SmolVLM2-500M-Video-InstructVLM model used by SmolVLA
--policy.chunk_sizeAction chunk size50Predicted action sequence length
--policy.n_action_stepsExecution steps50Number of actions executed each time
--policy.n_obs_stepsObservation history steps1Number of historical observation frames used

Training Monitoring

W&B is very useful if you like watching curves; if you just want to run the process first, you can skip it.

  • Enable W&B (Optional): Add the following to your training command:
    • --wandb.enable true
    • --wandb.project smolvla_experiments
    • (Optional) --wandb.notes "your note"
  • Recommended Metrics to Monitor:
    • Loss / Action Loss: Whether it's decreasing steadily and not diverging.
    • Learning Rate: Whether it's as expected after warmup.
    • GPU Memory: Whether it's close to the limit (helps decide whether to lower batch size / enable AMP).

Quick Validation (More reliable than writing evaluation scripts)

Often, you don't need to write a full evaluation suite first: run inference with a single frame of real input to quickly clear 80% of pitfalls (field keys, dimensions, dtype, normalization, etc.).

import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
"outputs/train/smolvla_finetuned/checkpoints/last",
device="cuda",
)
policy.eval()

# Recommended to take a "real frame" from your dataset for the following two:
# image_tensor: [1, 3, H, W] float32 (typically 0~1)
# state_tensor: [1, state_dim] float32
observation = {
"observation.images.cam_high": image_tensor,
"observation.state": state_tensor,
}

with torch.no_grad():
action = policy.select_action(observation)

print("action shape:", getattr(action, "shape", None))
tip

If you get a key error/shape error here, check two things first: dataset field names (e.g., if cam_high matches) and state/action dimensions (if gripper is missing/extra, or joint count is consistent).

Best Practices (Simplified)

  • Data: Better to have fewer but "stable and consistent" data (same camera naming / same state/action definition).
  • Starting Hyperparameters: Start with a configuration that runs (e.g., batch_size=16/32), confirm loss is normal before increasing.
  • Memory Saving Trio: --policy.use_amp true, decrease --batch_size, decrease --num_workers (priority on stability).
  • Checkpoints: Keep checkpoints/last for easier inference and regression testing.

FAQ

  • Q: Training runs fine, but inference reports field key mismatch?
    • A: Align camera keys (e.g., cam_high) and observation.state dimensions in the dataset first; such issues are usually due to inconsistent input mapping rather than a "broken model."
  • Q: OOM right at the start?
    • A: Lower --batch_size to 16/8, then enable --policy.use_amp true; also lower --num_workers to 2 or 1.
  • Q: Loss doesn't decrease much?
    • A: Confirm if data is "truly aligned" (state/action dimensions, time alignment, action smoothness) before considering LR or training step adjustments.