SmolVLA Model Fine-tuning
Overview
SmolVLA (Small Vision-Language-Action) is a lightweight vision-language-action model developed by HuggingFace, specifically designed for robot learning tasks. With only 450M parameters, this model is suitable for running on consumer-grade hardware and is an ideal choice for robot learning research and development.
Prerequisites
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
- Python Version: 3.8+
- GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 8GB VRAM
- Memory: At least 16GB RAM
- Storage: At least 50GB available space
Environment Setup
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot
# Install dependencies
pip install -e .
2. Install Additional Dependencies
# Install Flash Attention (optional, for training acceleration)
pip install flash-attn --no-build-isolation
# Install Weights & Biases (for experiment tracking)
pip install wandb
wandb login
Data Preparation
LeRobot Format Data
SmolVLA requires using LeRobot format datasets. Ensure your dataset contains the following structure:
your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...
Data Quality Requirements
According to HuggingFace recommendations, SmolVLA requires:
- Minimum 25 high-quality episodes to achieve good performance
- 100+ episodes recommended for optimal results
- Each episode should contain a complete task execution process
- Image resolution recommended at 224x224 or 256x256
Fine-tuning Training
Basic Training Command
# Set environment variables
export HF_USER="io-ai-data"
export CUDA_VISIBLE_DEVICES=0
# Start SmolVLA fine-tuning
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--dataset.root /data/lerobot_dataset \
--batch_size 64 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_finetuning \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.scheduler_warmup_steps 1000 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 5000 \
--wandb.enable true \
--wandb.project smolvla_finetuning
Advanced Training Configuration
Multi-GPU Training
# Multi-GPU training using torchrun
torchrun --nproc_per_node=2 --master_port=29500 \
$(which lerobot-train) \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--dataset.root /data/my_dataset \
--batch_size 32 \
--steps 20000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_multi_gpu \
--policy.device cuda \
--policy.optimizer_lr 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Memory Optimization Configuration
# For GPUs with smaller VRAM
lerobot-train \
--policy.type smolvla \
--policy.pretrained_path lerobot/smolvla_base \
--dataset.repo_id ${HF_USER}/my_dataset \
--batch_size 16 \
--steps 30000 \
--output_dir outputs/train/smolvla_finetuned \
--job_name smolvla_memory_optimized \
--policy.device cuda \
--policy.optimizer_lr 5e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Parameter Details
Core Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.type | Policy type | smolvla | SmolVLA model type |
--policy.pretrained_path | Pretrained model path | lerobot/smolvla_base | Official pretrained model on HuggingFace |
--dataset.repo_id | Dataset repository ID | ${HF_USER}/my_dataset | Your HuggingFace dataset |
--dataset.root | Dataset storage location | /data/my_dataset | Specify reading from local directory (optional) |
--batch_size | Batch size | 64 | Adjust based on VRAM, RTX 3080 recommended 32-64 |
--steps | Training steps | 20000 | Can reduce to 10000 for small datasets |
--output_dir | Output directory | outputs/train/smolvla_finetuned | Model save path |
--job_name | Job name | smolvla_finetuning | For logging and experiment tracking (optional) |
Training Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-4 | Can be appropriately reduced for fine-tuning |
--policy.scheduler_warmup_steps | Warmup steps | 1000 | Learning rate warmup, stabilizes training |
--policy.use_amp | Mixed precision | true | Saves VRAM, accelerates training |
--policy.optimizer_grad_clip_norm | Gradient clipping | 1.0 | Prevents gradient explosion |
--num_workers | Data loading threads | 4 | Adjust based on CPU core count |
--policy.push_to_hub | Push to Hub | false | Whether to upload model to HuggingFace (requires repo_id) |
--save_checkpoint | Save checkpoints | true | Whether to save training checkpoints |
--save_freq | Save frequency | 5000 | Checkpoint save interval steps |
Model-Specific Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.vlm_model_name | VLM backbone model | HuggingFaceTB/SmolVLM2-500M-Video-Instruct | Vision-language model used by SmolVLA |
--policy.chunk_size | Action chunk size | 50 | Length of predicted action sequence |
--policy.n_action_steps | Execute action steps | 50 | Number of actions actually executed each time |
--policy.n_obs_steps | Observation history steps | 1 | Number of historical observation frames used |
Training Monitoring
Weights & Biases Integration
SmolVLA supports W&B for experiment tracking:
# Enable W&B logging
lerobot-train \
--policy.type smolvla \
--dataset.repo_id your-name/your-repo \
--batch_size 64 \
--steps 20000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project smolvla_experiments \
--wandb.notes "SmolVLA finetuning on custom dataset" \
# ... other parameters
Key Metrics Monitoring
Metrics to monitor during training:
- Loss: Overall loss, should steadily decrease
- Action Loss: Action prediction loss
- Vision Loss: Visual feature loss
- Language Loss: Language understanding loss
- Learning Rate: Learning rate changes
- GPU Memory: VRAM usage
Model Evaluation
Saving and Loading Models
# Load fine-tuned model
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
policy = SmolVLAPolicy.from_pretrained(
"outputs/train/smolvla_finetuned/checkpoints/last",
device="cuda"
)
# Perform inference
observation = {
"observation.images.cam_high": image_tensor,
"observation.state": state_tensor
}
action = policy.select_action(observation)
Performance Evaluation Script
# evaluation.py
import torch
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def evaluate_model(model_path, dataset_path):
# Load model
policy = SmolVLAPolicy.from_pretrained(model_path, device="cuda")
# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")
total_loss = 0
num_samples = 0
with torch.no_grad():
for batch in dataset:
prediction = policy(batch)
loss = policy.compute_loss(prediction, batch)
total_loss += loss.item()
num_samples += 1
avg_loss = total_loss / num_samples
print(f"Average test loss: {avg_loss:.4f}")
return avg_loss
if __name__ == "__main__":
model_path = "outputs/train/smolvla_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_model(model_path, dataset_path)
Deployment and Inference
Real-time Inference Example
# inference.py
import torch
import numpy as np
from PIL import Image
from lerobot.policies.smolvla.modeling_smolvla import SmolVLAPolicy
class SmolVLAInference:
def __init__(self, model_path):
self.policy = SmolVLAPolicy.from_pretrained(
model_path,
device="cuda"
)
self.policy.eval()
def predict_action(self, image, state, instruction=""):
# Preprocess image
if isinstance(image, np.ndarray):
image = Image.fromarray(image)
# Build observation
observation = {
"observation.images.cam_high": self.preprocess_image(image),
"observation.state": torch.tensor(state, dtype=torch.float32).unsqueeze(0),
"task.language_instruction": instruction
}
# Predict action
with torch.no_grad():
action = self.policy.select_action(observation)
return action.cpu().numpy()
def preprocess_image(self, image):
# Image preprocessing logic
image = image.resize((224, 224))
image_tensor = torch.tensor(np.array(image)).permute(2, 0, 1).float() / 255.0
return image_tensor.unsqueeze(0)
# Usage example
if __name__ == "__main__":
inference = SmolVLAInference("outputs/train/smolvla_finetuned/checkpoints/last")
# Simulate input
image = np.random.randint(0, 255, (480, 640, 3), dtype=np.uint8)
state = np.random.randn(7) # 7-DOF robot state
instruction = "pick up the red cube"
action = inference.predict_action(image, state, instruction)
print(f"Predicted action: {action}")
Best Practices
Data Preparation Recommendations
- Data Quality: Ensure quality of demonstration data, avoid incorrect or inconsistent actions
- Data Diversity: Include data from different scenarios, lighting conditions, and object positions
- Task Descriptions: Provide clear natural language descriptions for each episode
- Data Balance: Ensure balance between success and failure cases
Training Optimization Recommendations
- Learning Rate Scheduling: Use learning rate warmup and decay strategies
- Regularization: Appropriately use dropout and weight decay
- Checkpoint Saving: Regularly save model checkpoints
- Early Stopping: Monitor validation loss to avoid overfitting
Hardware Optimization Recommendations
- VRAM Management: Use mixed precision training to save VRAM
- Batch Size: Adjust batch size based on VRAM capacity
- Data Loading: Use multi-process data loading to accelerate training
- Model Parallelism: For large models, consider using model parallelism
Frequently Asked Questions (FAQ)
Q: What advantages does SmolVLA have compared to other VLA models?
A: Main advantages of SmolVLA include:
- Lightweight: Only 450M parameters, suitable for consumer-grade hardware
- Efficient Training: Relatively short training time
- Good Performance: Excellent performance on multiple robot tasks
- Easy Deployment: Moderate model size, convenient for actual deployment
Q: How long does training take?
A: Training time depends on multiple factors:
- Dataset size: 100 episodes take approximately 2-4 hours (RTX 3080)
- Batch size: Larger batches can accelerate training
- Hardware configuration: Better GPUs can significantly reduce training time
- Training steps: 20000 steps are usually sufficient for good results
Q: How to determine if the model has converged?
A: Observe the following metrics:
- Loss curves: Overall loss should steadily decrease and plateau
- Validation performance: Performance on validation set no longer improves
- Action predictions: Model-predicted actions should be reasonable
- Actual testing: Test model performance in real environment
Q: What to do if VRAM is insufficient?
A: You can try the following methods:
- Reduce batch size (e.g., from 64 to 32 or 16):
--batch_size 16
- Enable mixed precision training:
--policy.use_amp true
- Reduce data loading threads:
--num_workers 2
- Use smaller image resolution:
--policy.resize_imgs_with_padding 224 224
- Reduce observation steps:
--policy.n_obs_steps 1
Q: How to improve model performance?
A: Methods to improve performance:
- Increase data volume: Collect more high-quality demonstration data
- Data augmentation: Use image augmentation techniques to increase data diversity
- Hyperparameter tuning: Adjust learning rate, batch size and other parameters
- Model ensembling: Train multiple models and ensemble them
- Domain adaptation: Additional fine-tuning for specific tasks
Related Resources
- SmolVLA Official Blog
- LeRobot Official Documentation
- SmolVLA Model Page
- LeRobot GitHub Repository
- Robot Learning Papers Collection
Changelog
- 2024-01: Initial version release
- 2024-02: Added multi-GPU training support
- 2024-03: Optimized memory usage and training efficiency
- 2024-04: Added more evaluation and deployment examples