Skip to main content

SmolVLA Model Fine-tuning

Overview

SmolVLA (Small Vision-Language-Action) is a lightweight vision-language-action model developed by HuggingFace, specifically designed for robot learning tasks. With only 450M parameters, this model is suitable for running on consumer-grade hardware and is an ideal choice for robot learning research and development.

Prerequisites

System Requirements

  • Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
  • Python Version: 3.8+
  • GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM
  • Memory: At least 16GB RAM
  • Storage: At least 50GB available space

Environment Setup

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot

# Install dependencies
pip install -e .

2. Install Additional Dependencies

# Install Flash Attention (optional, for training acceleration)
pip install flash-attn --no-build-isolation

# Install Weights & Biases (for experiment tracking)
pip install wandb
wandb login

Data Preparation

LeRobot Format Data

SmolVLA requires using LeRobot format datasets. Ensure your dataset contains the following structure:

your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...

Data Quality Requirements

According to HuggingFace recommendations, SmolVLA requires:

  • Minimum 25 high-quality episodes to achieve good performance
  • 100+ episodes recommended for optimal results
  • Each episode should contain a complete task execution process
  • Image resolution recommended at 224x224 or 256x256

Fine-tuning Training

Basic Training Command

# Set environment variables
export HF_USER="io-ai-data"
export CUDA_VISIBLE_DEVICES=0

# Start SmolVLA fine-tuning
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--dataset.root /data/lerobot_dataset \
--batch_size 64 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_finetuning \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.scheduler_warmup_steps 1000 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 5000 \
--wandb.enable true \
--wandb.project smolvla_finetuning

Advanced Training Configuration

Multi-GPU Training

# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
$(which lerobot-train) \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--dataset.root /data/my_dataset \
--batch_size 32 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_multi_gpu \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Memory Optimization Configuration

# For GPUs with smaller VRAM
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--batch_size 16 \
--steps 30000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_memory_optimized \
--policy.device cuda \
--policy.optimizer_lr 5e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Parameter Details

Core Parameters

ParameterMeaningRecommended ValueDescription
--policy.typePolicy typesmolvlaSmolVLA model type
--policy.pretrained_pathPretrained model pathlerobot/smolvla_baseOfficial pretrained model on HuggingFace
--dataset.repo_idDataset repository ID${HF_USER}/my_datasetYour HuggingFace dataset
--dataset.rootDataset storage location/data/my_datasetSpecify reading from local directory (optional)
--batch_sizeBatch size64Adjust based on VRAM, RTX 3080 recommended 32-64
--stepsTraining steps20000Can reduce to 10000 for small datasets
--output_dirOutput directoryoutputs/train/smolvla_finetunedModel save path
--job_nameJob namesmolvla_finetuningFor logging and experiment tracking (optional)

Training Parameters

ParameterMeaningRecommended ValueDescription
--policy.optimizer_lrLearning rate1e-4Can be appropriately reduced for fine-tuning
--policy.scheduler_warmup_stepsWarmup steps1000Learning rate warmup, stabilizes training
--policy.use_ampMixed precisiontrueSaves VRAM, accelerates training
--policy.optimizer_grad_clip_normGradient clipping1.0Prevents gradient explosion
--num_workersData loading threads4Adjust based on CPU core count
--policy.push_to_hubPush to HubfalseWhether to upload model to HuggingFace (requires repo_id)
--save_checkpointSave checkpointstrueWhether to save training checkpoints
--save_freqSave frequency5000Checkpoint save interval steps

Model-Specific Parameters

ParameterMeaningRecommended ValueDescription
--policy.vlm_model_nameVLM backbone modelHuggingFaceTB/SmolVLM2-500M-Video-InstructVision-language model used by SmolVLA
--policy.chunk_sizeAction chunk size50Length of predicted action sequence
--policy.n_action_stepsExecute action steps50Number of actions actually executed each time
--policy.n_obs_stepsObservation history steps1Number of historical observation frames used

Training Monitoring

Weights & Biases Integration

SmolVLA supports W&B for experiment tracking:

# Enable W&B logging
lerobot-train \
--policy.type smolvla \
--dataset.repo_id your-name/your-repo \
--batch_size 64 \
--steps 20000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project smolvla_experiments \
--wandb.notes "SmolVLA finetuning on custom dataset" \
# ... other parameters

Key Metrics Monitoring

Metrics to monitor during training:

  • Loss: Overall loss, should steadily decrease
  • Action Loss: Action prediction loss
  • Vision Loss: Visual feature loss
  • Language Loss: Language understanding loss
  • Learning Rate: Learning rate changes
  • GPU Memory: VRAM usage

Model Evaluation

Saving and Loading Models

# Load fine-tuned model
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

policy = SmolVLAPolicy.from_pretrained(
"outputs/train/smolvla_finetuned/checkpoints/last",
device="cuda"
)

# Perform inference
observation = {
"observation.images.cam_high": image_tensor,
"observation.state": state_tensor
}

action = policy.select_action(observation)

Performance Evaluation Script

# evaluation.py
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset

def evaluate_model(model_path, dataset_path):
# Load model
policy = SmolVLAPolicy.from_pretrained(model_path, device="cuda")

# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")

total_loss = 0
num_samples = 0

with torch.no_grad():
for batch in dataset:
prediction = policy(batch)
loss = policy.compute_loss(prediction, batch)
total_loss += loss.item()
num_samples += 1

avg_loss = total_loss / num_samples
print(f"Average test loss: {avg_loss:.4f}")

return avg_loss

if __name__ == "__main__":
model_path = "outputs/train/smolvla_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_model(model_path, dataset_path)

Deployment and Inference

Real-time Inference Example

# inference.py
import torch
import numpy as np
from PIL import Image
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy

class SmolVLAInference:
def __init__(self, model_path):
self.policy = SmolVLAPolicy.from_pretrained(
model_path,
device="cuda"
)
self.policy.eval()

def predict_action(self, image, state, instruction=""):
# Preprocess image
if isinstance(image, np.ndarray):
image = Image.fromarray(image)

# Build observation
observation = {
"observation.images.cam_high": self.preprocess_image(image),
"observation.state": torch.tensor(state, dtype=torch.float32).unsqueeze(0),
"task.language_instruction": instruction
}

# Predict action
with torch.no_grad():
action = self.policy.select_action(observation)

return action.cpu().numpy()

def preprocess_image(self, image):
# Image preprocessing logic
image = image.resize((224, 224))
image_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).float() / 255.0
return image_tensor.unsqueeze(0)

# Usage example
if __name__ == "__main__":
inference = SmolVLAInference("outputs/train/smolvla_finetuned/checkpoints/last")

# Simulate input
image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
state = np.random.randn(7) # 7-DOF robot state
instruction = "pick up the red cube"

action = inference.predict_action(image, state, instruction)
print(f"Predicted action: {action}")

Best Practices

Data Preparation Recommendations

  1. Data Quality: Ensure quality of demonstration data, avoid incorrect or inconsistent actions
  2. Data Diversity: Include data from different scenarios, lighting conditions, and object positions
  3. Task Descriptions: Provide clear natural language descriptions for each episode
  4. Data Balance: Ensure balance between success and failure cases

Training Optimization Recommendations

  1. Learning Rate Scheduling: Use learning rate warmup and decay strategies
  2. Regularization: Appropriately use dropout and weight decay
  3. Checkpoint Saving: Regularly save model checkpoints
  4. Early Stopping: Monitor validation loss to avoid overfitting

Hardware Optimization Recommendations

  1. VRAM Management: Use mixed precision training to save VRAM
  2. Batch Size: Adjust batch size based on VRAM capacity
  3. Data Loading: Use multi-process data loading to accelerate training
  4. Model Parallelism: For large models, consider using model parallelism

Frequently Asked Questions (FAQ)

Q: What advantages does SmolVLA have compared to other VLA models?

A: Main advantages of SmolVLA include:

  • Lightweight: Only 450M parameters, suitable for consumer-grade hardware
  • Efficient Training: Relatively short training time
  • Good Performance: Excellent performance on multiple robot tasks
  • Easy Deployment: Moderate model size, convenient for actual deployment

Q: How long does training take?

A: Training time depends on multiple factors:

  • Dataset size: 100 episodes take approximately 2-4 hours (RTX 3080)
  • Batch size: Larger batches can accelerate training
  • Hardware configuration: Better GPUs can significantly reduce training time
  • Training steps: 20000 steps are usually sufficient for good results

Q: How to determine if the model has converged?

A: Observe the following metrics:

  • Loss curves: Overall loss should steadily decrease and plateau
  • Validation performance: Performance on validation set no longer improves
  • Action predictions: Model-predicted actions should be reasonable
  • Actual testing: Test model performance in real environment

Q: What to do if VRAM is insufficient?

A: You can try the following methods:

  • Reduce batch size (e.g., from 64 to 32 or 16): --batch_size 16
  • Enable mixed precision training: --policy.use_amp true
  • Reduce data loading threads: --num_workers 2
  • Use smaller image resolution: --policy.resize_imgs_with_padding 224 224
  • Reduce observation steps: --policy.n_obs_steps 1

Q: How to improve model performance?

A: Methods to improve performance:

  • Increase data volume: Collect more high-quality demonstration data
  • Data augmentation: Use image augmentation techniques to increase data diversity
  • Hyperparameter tuning: Adjust learning rate, batch size and other parameters
  • Model ensembling: Train multiple models and ensemble them
  • Domain adaptation: Additional fine-tuning for specific tasks

Changelog

  • 2024-01: Initial version release
  • 2024-02: Added multi-GPU training support
  • 2024-03: Optimized memory usage and training efficiency
  • 2024-04: Added more evaluation and deployment examples