SmolVLA Model Fine-tuning
Overview
SmolVLA (Small Vision-Language-Action) is a lightweight vision-language-action model developed by HuggingFace, specifically designed for robot learning tasks. With only 450M parameters, this model is suitable for running on consumer-grade hardware and is an ideal choice for robot learning research and development.
Prerequisites
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
 - Python Version: 3.8+
 - GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM
 - Memory: At least 16GB RAM
 - Storage: At least 50GB available space
 
Environment Setup
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot
# Install dependencies
pip install -e .
2. Install Additional Dependencies
# Install Flash Attention (optional, for training acceleration)
pip install flash-attn --no-build-isolation
# Install Weights & Biases (for experiment tracking)
pip install wandb
wandb login
Data Preparation
LeRobot Format Data
SmolVLA requires using LeRobot format datasets. Ensure your dataset contains the following structure:
your_dataset/
├── data/
│   ├── chunk-001/
│   │   ├── observation.images.cam_high.png
│   │   ├── observation.images.cam_low.png
│   │   └── ...
│   └── chunk-002/
│       └── ...
├── meta.json
├── stats.safetensors
└── videos/
    ├── episode_000000.mp4
    └── ...
Data Quality Requirements
According to HuggingFace recommendations, SmolVLA requires:
- Minimum 25 high-quality episodes to achieve good performance
 - 100+ episodes recommended for optimal results
 - Each episode should contain a complete task execution process
 - Image resolution recommended at 224x224 or 256x256
 
Fine-tuning Training
Basic Training Command
# Set environment variables
export HF_USER="io-ai-data"
export CUDA_VISIBLE_DEVICES=0
# Start SmolVLA fine-tuning
lerobot-train \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --dataset.root /data/lerobot_dataset \
  --batch_size 64 \
  --steps 20000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_finetuning \
  --policy.device cuda \
  --policy.optimizer_lr 1e-4 \
  --policy.scheduler_warmup_steps 1000 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --save_freq 5000 \
  --wandb.enable true \
  --wandb.project smolvla_finetuning
Advanced Training Configuration
Multi-GPU Training
# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
  $(which lerobot-train) \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --dataset.root /data/my_dataset \
  --batch_size 32 \
  --steps 20000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_multi_gpu \
  --policy.device cuda \
  --policy.optimizer_lr 1e-4 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true
Memory Optimization Configuration
# For GPUs with smaller VRAM
lerobot-train \
  --policy.type smolvla \
  --policy.pretrained_path lerobot/smolvla_base \
  --dataset.repo_id ${HF_USER}/my_dataset \
  --batch_size 16 \
  --steps 30000 \
  --output_dir outputs/train/smolvla_finetuned \
  --job_name smolvla_memory_optimized \
  --policy.device cuda \
  --policy.optimizer_lr 5e-5 \
  --policy.use_amp true \
  --num_workers 2 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true
Parameter Details
Core Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.type | Policy type | smolvla | SmolVLA model type | 
--policy.pretrained_path | Pretrained model path | lerobot/smolvla_base | Official pretrained model on HuggingFace | 
--dataset.repo_id | Dataset repository ID | ${HF_USER}/my_dataset | Your HuggingFace dataset | 
--dataset.root | Dataset storage location | /data/my_dataset | Specify reading from local directory (optional) | 
--batch_size | Batch size | 64 | Adjust based on VRAM, RTX 3080 recommended 32-64 | 
--steps | Training steps | 20000 | Can reduce to 10000 for small datasets | 
--output_dir | Output directory | outputs/train/smolvla_finetuned | Model save path | 
--job_name | Job name | smolvla_finetuning | For logging and experiment tracking (optional) | 
Training Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-4 | Can be appropriately reduced for fine-tuning | 
--policy.scheduler_warmup_steps | Warmup steps | 1000 | Learning rate warmup, stabilizes training | 
--policy.use_amp | Mixed precision | true | Saves VRAM, accelerates training | 
--policy.optimizer_grad_clip_norm | Gradient clipping | 1.0 | Prevents gradient explosion | 
--num_workers | Data loading threads | 4 | Adjust based on CPU core count | 
--policy.push_to_hub | Push to Hub | false | Whether to upload model to HuggingFace (requires repo_id) | 
--save_checkpoint | Save checkpoints | true | Whether to save training checkpoints | 
--save_freq | Save frequency | 5000 | Checkpoint save interval steps | 
Model-Specific Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.vlm_model_name | VLM backbone model | HuggingFaceTB/SmolVLM2-500M-Video-Instruct | Vision-language model used by SmolVLA | 
--policy.chunk_size | Action chunk size | 50 | Length of predicted action sequence | 
--policy.n_action_steps | Execute action steps | 50 | Number of actions actually executed each time | 
--policy.n_obs_steps | Observation history steps | 1 | Number of historical observation frames used | 
Training Monitoring
Weights & Biases Integration
SmolVLA supports W&B for experiment tracking:
# Enable W&B logging
lerobot-train \
  --policy.type smolvla \
  --dataset.repo_id your-name/your-repo \
  --batch_size 64 \
  --steps 20000 \
  --policy.push_to_hub false \
  --wandb.enable true \
  --wandb.project smolvla_experiments \
  --wandb.notes "SmolVLA finetuning on custom dataset" \
  # ... other parameters
Key Metrics Monitoring
Metrics to monitor during training:
- Loss: Overall loss, should steadily decrease
 - Action Loss: Action prediction loss
 - Vision Loss: Visual feature loss
 - Language Loss: Language understanding loss
 - Learning Rate: Learning rate changes
 - GPU Memory: VRAM usage
 
Model Evaluation
Saving and Loading Models
# Load fine-tuned model
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained(
    "outputs/train/smolvla_finetuned/checkpoints/last",
    device="cuda"
)
# Perform inference
observation = {
    "observation.images.cam_high": image_tensor,
    "observation.state": state_tensor
}
action = policy.select_action(observation)
Performance Evaluation Script
# evaluation.py
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def evaluate_model(model_path, dataset_path):
    # Load model
    policy = SmolVLAPolicy.from_pretrained(model_path, device="cuda")
    
    # Load test dataset
    dataset = LeRobotDataset(dataset_path, split="test")
    
    total_loss = 0
    num_samples = 0
    
    with torch.no_grad():
        for batch in dataset:
            prediction = policy(batch)
            loss = policy.compute_loss(prediction, batch)
            total_loss += loss.item()
            num_samples += 1
    
    avg_loss = total_loss / num_samples
    print(f"Average test loss: {avg_loss:.4f}")
    
    return avg_loss
if __name__ == "__main__":
    model_path = "outputs/train/smolvla_finetuned/checkpoints/last"
    dataset_path = "path/to/your/test/dataset"
    evaluate_model(model_path, dataset_path)
Deployment and Inference
Real-time Inference Example
# inference.py
import torch
import numpy as np
from PIL import Image
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
class SmolVLAInference:
    def __init__(self, model_path):
        self.policy = SmolVLAPolicy.from_pretrained(
            model_path, 
            device="cuda"
        )
        self.policy.eval()
    
    def predict_action(self, image, state, instruction=""):
        # Preprocess image
        if isinstance(image, np.ndarray):
            image = Image.fromarray(image)
        
        # Build observation
        observation = {
            "observation.images.cam_high": self.preprocess_image(image),
            "observation.state": torch.tensor(state, dtype=torch.float32).unsqueeze(0),
            "task.language_instruction": instruction
        }
        
        # Predict action
        with torch.no_grad():
            action = self.policy.select_action(observation)
        
        return action.cpu().numpy()
    
    def preprocess_image(self, image):
        # Image preprocessing logic
        image = image.resize((224, 224))
        image_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).float() / 255.0
        return image_tensor.unsqueeze(0)
# Usage example
if __name__ == "__main__":
    inference = SmolVLAInference("outputs/train/smolvla_finetuned/checkpoints/last")
    
    # Simulate input
    image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
    state = np.random.randn(7)  # 7-DOF robot state
    instruction = "pick up the red cube"
    
    action = inference.predict_action(image, state, instruction)
    print(f"Predicted action: {action}")
Best Practices
Data Preparation Recommendations
- Data Quality: Ensure quality of demonstration data, avoid incorrect or inconsistent actions
 - Data Diversity: Include data from different scenarios, lighting conditions, and object positions
 - Task Descriptions: Provide clear natural language descriptions for each episode
 - Data Balance: Ensure balance between success and failure cases
 
Training Optimization Recommendations
- Learning Rate Scheduling: Use learning rate warmup and decay strategies
 - Regularization: Appropriately use dropout and weight decay
 - Checkpoint Saving: Regularly save model checkpoints
 - Early Stopping: Monitor validation loss to avoid overfitting
 
Hardware Optimization Recommendations
- VRAM Management: Use mixed precision training to save VRAM
 - Batch Size: Adjust batch size based on VRAM capacity
 - Data Loading: Use multi-process data loading to accelerate training
 - Model Parallelism: For large models, consider using model parallelism
 
Frequently Asked Questions (FAQ)
Q: What advantages does SmolVLA have compared to other VLA models?
A: Main advantages of SmolVLA include:
- Lightweight: Only 450M parameters, suitable for consumer-grade hardware
 - Efficient Training: Relatively short training time
 - Good Performance: Excellent performance on multiple robot tasks
 - Easy Deployment: Moderate model size, convenient for actual deployment
 
Q: How long does training take?
A: Training time depends on multiple factors:
- Dataset size: 100 episodes take approximately 2-4 hours (RTX 3080)
 - Batch size: Larger batches can accelerate training
 - Hardware configuration: Better GPUs can significantly reduce training time
 - Training steps: 20000 steps are usually sufficient for good results
 
Q: How to determine if the model has converged?
A: Observe the following metrics:
- Loss curves: Overall loss should steadily decrease and plateau
 - Validation performance: Performance on validation set no longer improves
 - Action predictions: Model-predicted actions should be reasonable
 - Actual testing: Test model performance in real environment
 
Q: What to do if VRAM is insufficient?
A: You can try the following methods:
- Reduce batch size (e.g., from 64 to 32 or 16): 
--batch_size 16 - Enable mixed precision training: 
--policy.use_amp true - Reduce data loading threads: 
--num_workers 2 - Use smaller image resolution: 
--policy.resize_imgs_with_padding 224 224 - Reduce observation steps: 
--policy.n_obs_steps 1 
Q: How to improve model performance?
A: Methods to improve performance:
- Increase data volume: Collect more high-quality demonstration data
 - Data augmentation: Use image augmentation techniques to increase data diversity
 - Hyperparameter tuning: Adjust learning rate, batch size and other parameters
 - Model ensembling: Train multiple models and ensemble them
 - Domain adaptation: Additional fine-tuning for specific tasks
 
Related Resources
- SmolVLA Official Blog
 - LeRobot Official Documentation
 - SmolVLA Model Page
 - LeRobot GitHub Repository
 - Robot Learning Papers Collection
 
Changelog
- 2024-01: Initial version release
 - 2024-02: Added multi-GPU training support
 - 2024-03: Optimized memory usage and training efficiency
 - 2024-04: Added more evaluation and deployment examples