ACT (Action Chunking Transformer) Model Fine-tuning
Overview
ACT (Action Chunking Transformer) is an end-to-end imitation learning model specifically designed for dexterous manipulation tasks. The model overcomes the compounding error problem in traditional imitation learning by predicting action chunks, achieving high success rates in robot manipulation on low-cost hardware.
Core Features
- Action Chunking Prediction: Predicts multiple consecutive actions at once, reducing compounding errors
- Transformer Architecture: Utilizes attention mechanism to process sequential data
- End-to-End Training: Directly predicts actions from raw observations
- High Success Rate: Excels at dexterous manipulation tasks
- Hardware Friendly: Can run on consumer-grade hardware
Prerequisites
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
- Python Version: 3.8+
- GPU: NVIDIA GPU (RTX 3070 or higher recommended), at least 6GB VRAM
- Memory: At least 16GB RAM
- Storage: At least 30GB available space
Environment Setup
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot
# Install dependencies
pip install -e .
2. Install ACT-Specific Dependencies
# Install additional packages
pip install einops
pip install timm
pip install wandb
# Login to Weights & Biases (optional)
wandb login
ACT Model Architecture
Core Components
- Vision Encoder: Processes multi-view image inputs
- State Encoder: Processes robot state information
- Transformer Decoder: Generates action sequences
- Action Head: Outputs final action predictions
Key Parameters
- Chunk Size: Number of actions predicted at once (typically 50-100)
- Context Length: Length of historical observations
- Hidden Dimension: Transformer hidden dimension
- Number of Heads: Number of attention heads
- Number of Layers: Number of Transformer layers
Data Preparation
LeRobot Format Data
ACT requires using LeRobot format datasets, containing the following structure:
your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ ├── observation.images.cam_left_wrist.png
│ │ ├── observation.images.cam_right_wrist.png
│ │ ├── observation.state.npy
│ │ ├── action.npy
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...
Data Quality Requirements
- Minimum 50 episodes for basic training
- 200+ episodes recommended for optimal results
- Each episode should contain complete task execution
- Multi-view images (at least 2 cameras)
- High-quality action annotations
Fine-tuning Training
The ACT model's n_action_steps
must be ≤ chunk_size
. It is recommended to set both to the same value (e.g., both set to 100).
chunk_size
: Length of action sequence predicted by the model at oncen_action_steps
: Number of action steps actually executed
Basic Training Command
# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0
# Start ACT training
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 8 \
--steps 50000 \
--output_dir outputs/train/act_finetuned \
--job_name act_finetuning \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 10000 \
--wandb.enable true
Advanced Training Configuration
Multi-Camera Configuration
# ACT training with multi-camera setup
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 4 \
--steps 100000 \
--output_dir outputs/train/act_multicam \
--job_name act_multicam_training \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 2 \
--policy.vision_backbone resnet18 \
--policy.dim_model 512 \
--policy.dim_feedforward 3200 \
--policy.n_encoder_layers 4 \
--policy.n_decoder_layers 1 \
--policy.n_heads 8 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Memory Optimization Configuration
# For GPUs with smaller VRAM
lerobot-train \
--policy.type act \
--dataset.repo_id io-ai-data/lerobot_data \
--batch_size 2 \
--steps 75000 \
--output_dir outputs/train/act_memory_opt \
--job_name act_memory_optimized \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.vision_backbone resnet18 \
--policy.dim_model 256 \
--policy.optimizer_lr 1e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Parameter Details
Core Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.type | Policy type | act | ACT model type |
--policy.pretrained_path | Pretrained model path | lerobot/act | LeRobot official ACT model (optional) |
--dataset.repo_id | Dataset repository ID | ${HF_USER}/dataset | Your HuggingFace dataset |
--batch_size | Batch size | 8 | Adjust based on VRAM, RTX 3070 recommended 4-8 |
--steps | Training steps | 50000 | Dexterous tasks recommended 50000-100000 steps |
--output_dir | Output directory | outputs/train/act_finetuned | Model save path |
--job_name | Job name | act_finetuning | For logging and experiment tracking (optional) |
ACT-Specific Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.chunk_size | Action chunk size | 100 | Number of actions predicted each time |
--policy.n_action_steps | Execute action steps | 100 | Number of actions actually executed |
--policy.n_obs_steps | Observation steps | 1 | Number of historical observations |
--policy.vision_backbone | Vision backbone | resnet18 | Image feature extraction network |
--policy.dim_model | Model dimension | 512 | Transformer main dimension |
--policy.dim_feedforward | Feedforward dimension | 3200 | Transformer feedforward layer dimension |
--policy.n_encoder_layers | Encoder layers | 4 | Number of Transformer encoder layers |
--policy.n_decoder_layers | Decoder layers | 1 | Number of Transformer decoder layers |
--policy.n_heads | Attention heads | 8 | Number of multi-head attention heads |
--policy.use_vae | Use VAE | true | Variational objective optimization |
Training Parameters
Parameter | Meaning | Recommended Value | Description |
---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-5 | ACT recommends smaller learning rates |
--policy.optimizer_weight_decay | Weight decay | 0.0 | Regularization parameter |
--policy.optimizer_lr_backbone | Backbone learning rate | 1e-5 | Vision encoder learning rate |
--policy.use_amp | Mixed precision | true | Saves VRAM |
--num_workers | Data loading threads | 4 | Adjust based on CPU core count |
--policy.push_to_hub | Push to Hub | false | Whether to upload model to HuggingFace (requires repo_id) |
--save_checkpoint | Save checkpoints | true | Whether to save training checkpoints |
--save_freq | Save frequency | 10000 | Checkpoint save interval |
Training Monitoring and Debugging
Weights & Biases Integration
# Detailed W&B configuration
lerobot-train \
--policy.type act \
--dataset.repo_id your-name/your-dataset \
--batch_size 8 \
--steps 50000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project act_experiments \
--wandb.notes "ACT training with 4 cameras" \
# ... other parameters
Key Metrics Monitoring
Metrics to monitor during training:
- Total Loss: Overall loss, should steadily decrease
- Action Loss: Action prediction loss (L1/L2 loss)
- Learning Rate: Learning rate change curve
- Gradient Norm: Gradient norm, monitor gradient explosion
- GPU Memory: VRAM usage
- Training Speed: Samples processed per second
Training Log Analysis
# log_analysis.py
import wandb
import matplotlib.pyplot as plt
def analyze_training_logs(project_name, run_name):
api = wandb.Api()
run = api.run(f"{project_name}/{run_name}")
# Get training metrics
history = run.history()
# Plot loss curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.plot(history['step'], history['train/total_loss'])
plt.title('Total Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.subplot(1, 3, 2)
plt.plot(history['step'], history['train/action_loss'])
plt.title('Action Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.subplot(1, 3, 3)
plt.plot(history['step'], history['train/learning_rate'])
plt.title('Learning Rate')
plt.xlabel('Step')
plt.ylabel('LR')
plt.tight_layout()
plt.savefig('training_analysis.png')
plt.show()
if __name__ == "__main__":
analyze_training_logs("act_experiments", "act_v1_multicam")
Model Evaluation
Offline Evaluation
# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def evaluate_act_model(model_path, dataset_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cuda")
policy.eval()
# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")
total_l1_loss = 0
total_l2_loss = 0
num_samples = 0
with torch.no_grad():
for batch in dataset:
# Model prediction
prediction = policy(batch)
# Calculate loss
target_actions = batch['action']
predicted_actions = prediction['action']
l1_loss = torch.mean(torch.abs(predicted_actions - target_actions))
l2_loss = torch.mean((predicted_actions - target_actions) ** 2)
total_l1_loss += l1_loss.item()
total_l2_loss += l2_loss.item()
num_samples += 1
avg_l1_loss = total_l1_loss / num_samples
avg_l2_loss = total_l2_loss / num_samples
print(f"Average L1 Loss: {avg_l1_loss:.4f}")
print(f"Average L2 Loss: {avg_l2_loss:.4f}")
return avg_l1_loss, avg_l2_loss
if __name__ == "__main__":
model_path = "outputs/train/act_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_act_model(model_path, dataset_path)
Online Evaluation (Robot Environment)
# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
class ACTRobotController:
def __init__(self, model_path, camera_names):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
self.camera_names = camera_names
self.action_queue = []
def get_action(self, observations):
# If action queue is empty, predict new action chunk
if len(self.action_queue) == 0:
with torch.no_grad():
# Build input
batch = self.prepare_observation(observations)
# Predict action chunk
prediction = self.policy(batch)
actions = prediction['action'].cpu().numpy()[0] # [chunk_size, action_dim]
# Add actions to queue
self.action_queue = list(actions)
# Return next action from queue
return self.action_queue.pop(0)
def prepare_observation(self, observations):
batch = {}
# Process image observations
for cam_name in self.camera_names:
image_key = f"observation.images.{cam_name}"
if image_key in observations:
image = observations[image_key]
# Preprocess image (normalize, resize, etc.)
image_tensor = self.preprocess_image(image)
batch[image_key] = image_tensor.unsqueeze(0)
# Process state observations
if "observation.state" in observations:
state = torch.tensor(observations["observation.state"], dtype=torch.float32)
batch["observation.state"] = state.unsqueeze(0)
return batch
def preprocess_image(self, image):
# Image preprocessing logic
# This needs to match the preprocessing used during training
image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
return image_tensor
# Usage example
if __name__ == "__main__":
controller = ACTRobotController(
model_path="outputs/train/act_finetuned/checkpoints/last",
camera_names=["cam_high", "cam_low", "cam_left_wrist", "cam_right_wrist"]
)
# Simulate robot control loop
for step in range(100):
# Get current observation (this needs to be obtained from the actual robot)
observations = {
"observation.images.cam_high": np.random.randint(0, 255, (480, 640, 3)),
"observation.images.cam_low": np.random.randint(0, 255, (480, 640, 3)),
"observation.state": np.random.randn(7)
}
# Get action
action = controller.get_action(observations)
# Execute action (send to robot)
print(f"Step {step}: Action = {action}")
# This should send the action to the actual robot
# robot.execute_action(action)
Deployment and Optimization
Model Quantization
# quantization.py
import torch
from lerobot.policies.act.modeling_act import ACTPolicy
def quantize_act_model(model_path, output_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cpu")
policy.eval()
# Dynamic quantization
quantized_policy = torch.quantization.quantize_dynamic(
policy,
{torch.nn.Linear},
dtype=torch.qint8
)
# Save quantized model
torch.save(quantized_policy.state_dict(), output_path)
print(f"Quantized model saved to {output_path}")
return quantized_policy
if __name__ == "__main__":
quantize_act_model(
"outputs/train/act_finetuned/checkpoints/last",
"outputs/act_quantized.pth"
)
Inference Optimization
# optimized_inference.py
import torch
import torch.jit
from lerobot.policies.act.modeling_act import ACTPolicy
class OptimizedACTInference:
def __init__(self, model_path, use_jit=True):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
if use_jit:
# Optimize using TorchScript
self.policy = torch.jit.script(self.policy)
# Warmup model
self.warmup()
def warmup(self):
# Warmup model with dummy data
dummy_batch = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}
with torch.no_grad():
for _ in range(10):
_ = self.policy(dummy_batch)
@torch.no_grad()
def predict(self, observations):
# Fast inference
prediction = self.policy(observations)
return prediction['action'].cpu().numpy()
if __name__ == "__main__":
inference = OptimizedACTInference(
"outputs/train/act_finetuned/checkpoints/last"
)
# Test inference speed
import time
observations = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}
start_time = time.time()
for _ in range(100):
action = inference.predict(observations)
end_time = time.time()
avg_inference_time = (end_time - start_time) / 100
print(f"Average inference time: {avg_inference_time:.4f} seconds")
print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")
Best Practices
Data Collection Recommendations
- Multi-view Data: Use multiple cameras to obtain rich visual information
- High-quality Demonstrations: Ensure consistency and accuracy of demonstration data
- Task Diversity: Include different starting states and goal configurations
- Failure Cases: Appropriately include failure cases to improve robustness
Training Optimization Recommendations
- Action Chunk Size: Adjust chunk_size based on task complexity
- Learning Rate Scheduling: Use cosine annealing or step decay
- Regularization: Appropriately use weight decay and dropout
- Data Augmentation: Apply appropriate augmentation to images
Deployment Optimization Recommendations
- Model Compression: Use quantization and pruning techniques to reduce model size
- Inference Acceleration: Use TensorRT or ONNX for inference optimization
- Memory Management: Properly manage action queues and observation caches
- Real-time Guarantee: Ensure inference frequency meets control requirements
Frequently Asked Questions (FAQ)
Q: What advantages does ACT have compared to other imitation learning methods?
A: Main advantages of ACT include:
- Reduced Compounding Error: Reduces error accumulation by predicting action chunks
- Improved Success Rate: Excels at dexterous manipulation tasks
- End-to-End Training: No need for hand-crafted features
- Multimodal Fusion: Effectively fuses visual and state information
Q: How to choose the appropriate chunk_size?
A: The choice of chunk_size depends on task characteristics:
- Fast tasks: chunk_size = 10-30
- Medium tasks: chunk_size = 50-100
- Slow tasks: chunk_size = 100-200
- Generally recommended to start trying from 50
Q: How long does training take?
A: Training time depends on multiple factors:
- Dataset size: 100 episodes take approximately 4-8 hours (RTX 3070)
- Model complexity: Larger models require more time
- Hardware configuration: Better GPUs can significantly reduce training time
- Convergence requirement: Typically requires 50000-100000 steps
Q: How to handle multi-camera data?
A: Multi-camera processing recommendations:
- Camera selection: Choose complementary viewpoints
- Feature fusion: Fuse at the feature level
- Attention mechanism: Let the model learn to focus on important viewpoints
- Computing resources: Note that multi-camera increases computational burden
Q: How to improve model generalization?
A: Methods to improve generalization:
- Data diversity: Collect data under different conditions
- Data augmentation: Use image and action augmentation techniques
- Regularization: Appropriate weight decay and dropout
- Domain randomization: Use domain randomization techniques in simulation
- Multi-task learning: Train jointly on multiple related tasks
Related Resources
- ACT Original Paper
- LeRobot ACT Implementation
- ACT Official Code
- Robot Learning Tutorial
- LeRobot Documentation
Changelog
- 2024-01: Initial version release
- 2024-02: Added multi-camera support
- 2024-03: Optimized training efficiency and inference speed
- 2024-04: Added model compression and deployment optimization