SmolVLA Model Fine-tuning

Overview

SmolVLA (Small Vision-Language-Action) is a lightweight vision-language-action model developed by HuggingFace, specifically designed for robot learning tasks. With only 450M parameters, this model is suitable for running on consumer-grade hardware and is an ideal choice for robot learning research and development.

Prerequisites

System Requirements

Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
Python Version: 3.8+
GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM
Memory: At least 16GB RAM
Storage: At least 50GB available space

Environment Setup

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot

# Install dependencies
pip install -e .

2. Install Additional Dependencies

# Install Flash Attention (optional, for training acceleration)
pip install flash-attn --no-build-isolation

# Install Weights & Biases (for experiment tracking)
pip install wandb
wandb login

Data Preparation

LeRobot Format Data

SmolVLA requires using LeRobot format datasets. Ensure your dataset contains the following structure:

your_dataset/
├── data/
│   ├── chunk-001/
│   │   ├── observation.images.cam_high.png
│   │   ├── observation.images.cam_low.png
│   │   └── ...
│   └── chunk-002/
│       └── ...
├── meta.json
├── stats.safetensors
└── videos/
    ├── episode_000000.mp4
    └── ...

Data Quality Requirements

According to HuggingFace recommendations, SmolVLA requires:

Minimum 25 high-quality episodes to achieve good performance
100+ episodes recommended for optimal results
Each episode should contain a complete task execution process
Image resolution recommended at 224x224 or 256x256

Fine-tuning Training

Basic Training Command

# Set environment variables
export HF_USER="io-ai-data"
export CUDA_VISIBLE_DEVICES=0

# Start SmolVLA fine-tuning
lerobot-train \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --dataset.root /data/lerobot_dataset \
  --batch_size 64 \
  --steps 20000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_finetuning \
  --policy.device cuda \
  --policy.optimizer_lr 1e-4 \
  --policy.scheduler_warmup_steps 1000 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --save_freq 5000 \
  --wandb.enable true \
  --wandb.project smolvla_finetuning

Advanced Training Configuration

Multi-GPU Training

# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
  $(which lerobot-train) \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --dataset.root /data/my_dataset \
  --batch_size 32 \
  --steps 20000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_multi_gpu \
  --policy.device cuda \
  --policy.optimizer_lr 1e-4 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true

Memory Optimization Configuration

# For GPUs with smaller VRAM
lerobot-train \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --batch_size 16 \
  --steps 30000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_memory_optimized \
  --policy.device cuda \
  --policy.optimizer_lr 5e-5 \
  --policy.use_amp true \
  --num_workers 2 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true

Parameter Details

Core Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.type`	Policy type	`smolvla`	SmolVLA model type
`--policy.pretrained_path`	Pretrained model path	`lerobot/smolvla_base`	Official pretrained model on HuggingFace
`--dataset.repo_id`	Dataset repository ID	`${HF_USER}/my_dataset`	Your HuggingFace dataset
`--dataset.root`	Dataset storage location	`/data/my_dataset`	Specify reading from local directory (optional)
`--batch_size`	Batch size	`64`	Adjust based on VRAM, RTX 3080 recommended 32-64
`--steps`	Training steps	`20000`	Can reduce to 10000 for small datasets
`--output_dir`	Output directory	`outputs/train/smolvla_finetuned`	Model save path
`--job_name`	Job name	`smolvla_finetuning`	For logging and experiment tracking (optional)

Training Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.optimizer_lr`	Learning rate	`1e-4`	Can be appropriately reduced for fine-tuning
`--policy.scheduler_warmup_steps`	Warmup steps	`1000`	Learning rate warmup, stabilizes training
`--policy.use_amp`	Mixed precision	`true`	Saves VRAM, accelerates training
`--policy.optimizer_grad_clip_norm`	Gradient clipping	`1.0`	Prevents gradient explosion
`--num_workers`	Data loading threads	`4`	Adjust based on CPU core count
`--policy.push_to_hub`	Push to Hub	`false`	Whether to upload model to HuggingFace (requires repo_id)
`--save_checkpoint`	Save checkpoints	`true`	Whether to save training checkpoints
`--save_freq`	Save frequency	`5000`	Checkpoint save interval steps

Model-Specific Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.vlm_model_name`	VLM backbone model	`HuggingFaceTB/SmolVLM2-500M-Video-Instruct`	Vision-language model used by SmolVLA
`--policy.chunk_size`	Action chunk size	`50`	Length of predicted action sequence
`--policy.n_action_steps`	Execute action steps	`50`	Number of actions actually executed each time
`--policy.n_obs_steps`	Observation history steps	`1`	Number of historical observation frames used

Training Monitoring

Weights & Biases Integration

SmolVLA supports W&B for experiment tracking:

# Enable W&B logging
lerobot-train \
  --policy.type smolvla \
  --dataset.repo_id your-name/your-repo \
  --batch_size 64 \
  --steps 20000 \
  --policy.push_to_hub false \
  --wandb.enable true \
  --wandb.project smolvla_experiments \
  --wandb.notes "SmolVLA finetuning on custom dataset" \
  # ... other parameters

Key Metrics Monitoring

Metrics to monitor during training:

Loss: Overall loss, should steadily decrease
Action Loss: Action prediction loss
Vision Loss: Visual feature loss
Language Loss: Language understanding loss
Learning Rate: Learning rate changes
GPU Memory: VRAM usage

Model Evaluation

Saving and Loading Models

# Load fine-tuned model
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
    "outputs/train/smolvla_finetuned/checkpoints/last",
    device="cuda"
)

# Perform inference
observation = {
    "observation.images.cam_high": image_tensor,
    "observation.state": state_tensor
}

action = policy.select_action(observation)

Performance Evaluation Script

# evaluation.py
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset

def evaluate_model(model_path, dataset_path):
    # Load model
    policy = SmolVLAPolicy.from_pretrained(model_path, device="cuda")
    
    # Load test dataset
    dataset = LeRobotDataset(dataset_path, split="test")
    
    total_loss = 0
    num_samples = 0
    
    with torch.no_grad():
        for batch in dataset:
            prediction = policy(batch)
            loss = policy.compute_loss(prediction, batch)
            total_loss += loss.item()
            num_samples += 1
    
    avg_loss = total_loss / num_samples
    print(f"Average test loss: {avg_loss:.4f}")
    
    return avg_loss

if __name__ == "__main__":
    model_path = "outputs/train/smolvla_finetuned/checkpoints/last"
    dataset_path = "path/to/your/test/dataset"
    evaluate_model(model_path, dataset_path)

Deployment and Inference

Real-time Inference Example

# inference.py
import torch
import numpy as np
from PIL import Image
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

class SmolVLAInference:
    def __init__(self, model_path):
        self.policy = SmolVLAPolicy.from_pretrained(
            model_path, 
            device="cuda"
        )
        self.policy.eval()
    
    def predict_action(self, image, state, instruction=""):
        # Preprocess image
        if isinstance(image, np.ndarray):
            image = Image.fromarray(image)
        
        # Build observation
        observation = {
            "observation.images.cam_high": self.preprocess_image(image),
            "observation.state": torch.tensor(state, dtype=torch.float32).unsqueeze(0),
            "task.language_instruction": instruction
        }
        
        # Predict action
        with torch.no_grad():
            action = self.policy.select_action(observation)
        
        return action.cpu().numpy()
    
    def preprocess_image(self, image):
        # Image preprocessing logic
        image = image.resize((224, 224))
        image_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).float() / 255.0
        return image_tensor.unsqueeze(0)

# Usage example
if __name__ == "__main__":
    inference = SmolVLAInference("outputs/train/smolvla_finetuned/checkpoints/last")
    
    # Simulate input
    image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
    state = np.random.randn(7)  # 7-DOF robot state
    instruction = "pick up the red cube"
    
    action = inference.predict_action(image, state, instruction)
    print(f"Predicted action: {action}")

Best Practices

Data Preparation Recommendations

Data Quality: Ensure quality of demonstration data, avoid incorrect or inconsistent actions
Data Diversity: Include data from different scenarios, lighting conditions, and object positions
Task Descriptions: Provide clear natural language descriptions for each episode
Data Balance: Ensure balance between success and failure cases

Training Optimization Recommendations

Learning Rate Scheduling: Use learning rate warmup and decay strategies
Regularization: Appropriately use dropout and weight decay
Checkpoint Saving: Regularly save model checkpoints
Early Stopping: Monitor validation loss to avoid overfitting

Hardware Optimization Recommendations

VRAM Management: Use mixed precision training to save VRAM
Batch Size: Adjust batch size based on VRAM capacity
Data Loading: Use multi-process data loading to accelerate training
Model Parallelism: For large models, consider using model parallelism

Frequently Asked Questions (FAQ)

Q: What advantages does SmolVLA have compared to other VLA models?

A: Main advantages of SmolVLA include:

Lightweight: Only 450M parameters, suitable for consumer-grade hardware
Efficient Training: Relatively short training time
Good Performance: Excellent performance on multiple robot tasks
Easy Deployment: Moderate model size, convenient for actual deployment

Q: How long does training take?

A: Training time depends on multiple factors:

Dataset size: 100 episodes take approximately 2-4 hours (RTX 3080)
Batch size: Larger batches can accelerate training
Hardware configuration: Better GPUs can significantly reduce training time
Training steps: 20000 steps are usually sufficient for good results

Q: How to determine if the model has converged?

A: Observe the following metrics:

Loss curves: Overall loss should steadily decrease and plateau
Validation performance: Performance on validation set no longer improves
Action predictions: Model-predicted actions should be reasonable
Actual testing: Test model performance in real environment

Q: What to do if VRAM is insufficient?

A: You can try the following methods:

Reduce batch size (e.g., from 64 to 32 or 16): --batch_size 16
Enable mixed precision training: --policy.use_amp true
Reduce data loading threads: --num_workers 2
Use smaller image resolution: --policy.resize_imgs_with_padding 224 224
Reduce observation steps: --policy.n_obs_steps 1

Q: How to improve model performance?

A: Methods to improve performance:

Increase data volume: Collect more high-quality demonstration data
Data augmentation: Use image augmentation techniques to increase data diversity
Hyperparameter tuning: Adjust learning rate, batch size and other parameters
Model ensembling: Train multiple models and ensemble them
Domain adaptation: Additional fine-tuning for specific tasks

Changelog

2024-01: Initial version release
2024-02: Added multi-GPU training support
2024-03: Optimized memory usage and training efficiency
2024-04: Added more evaluation and deployment examples

Overview​

Prerequisites​

System Requirements​

Environment Setup​

1. Install LeRobot​

2. Install Additional Dependencies​

Data Preparation​

LeRobot Format Data​

Data Quality Requirements​

Fine-tuning Training​

Basic Training Command​

Advanced Training Configuration​

Multi-GPU Training​

Memory Optimization Configuration​

Parameter Details​

Core Parameters​

Training Parameters​

Model-Specific Parameters​

Training Monitoring​

Weights & Biases Integration​

Key Metrics Monitoring​

Model Evaluation​

Saving and Loading Models​

Performance Evaluation Script​

Deployment and Inference​

Real-time Inference Example​

Best Practices​

Data Preparation Recommendations​

Training Optimization Recommendations​

Hardware Optimization Recommendations​

Frequently Asked Questions (FAQ)​

Q: What advantages does SmolVLA have compared to other VLA models?​

Q: How long does training take?​

Q: How to determine if the model has converged?​

Q: What to do if VRAM is insufficient?​

Q: How to improve model performance?​

Related Resources​

Changelog​