ACT (Action Chunking Transformer) Model Fine-tuning
Overview
ACT (Action Chunking Transformer) is an end-to-end imitation learning model specifically designed for fine-grained manipulation tasks. By predicting action chunks, the model overcomes the compound error problem common in traditional imitation learning, achieving high success rates in robot operations even on low-cost hardware.
Key Features
- Action Chunking Prediction: Predicts multiple continuous actions at once to reduce compound errors.
- Transformer Architecture: Utilizes attention mechanisms to process sequential data.
- End-to-End Training: Predicts actions directly from raw observations.
- High Success Rate: Performs exceptionally well on fine manipulation tasks.
- Hardware Friendly: Capable of running on consumer-grade hardware.
Prerequisites
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended) or macOS.
- Python Version: 3.8+.
- GPU: NVIDIA GPU (RTX 3070 or higher recommended) with at least 6GB VRAM.
- Memory: At least 16GB RAM.
- Storage: At least 30GB available space.
Environment Preparation
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment (venv recommended; conda is also fine)
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip
# Install dependencies
pip install -e .
2. Install ACT-specific Dependencies
# Install additional packages
pip install einops
pip install timm
pip install wandb
# Login to Weights & Biases (optional)
wandb login
ACT Model Architecture
Core Components
- Vision Encoder: Processes multi-view image inputs.
- State Encoder: Processes robot state information.
- Transformer Decoder: Generates action sequences.
- Action Head: Outputs final action predictions.
Key Parameters
- Chunk Size: Number of actions predicted at once (typically 50-100).
- Context Length: Length of historical observations.
- Hidden Dimension: Hidden dimension of the Transformer.
- Number of Heads: Number of attention heads.
- Number of Layers: Number of Transformer layers.
Data Preparation
LeRobot Format Data
ACT requires datasets in LeRobot format with the following structure:
your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ ├── observation.images.cam_left_wrist.png
│ │ ├── observation.images.cam_right_wrist.png
│ │ ├── observation.state.npy
│ │ ├── action.npy
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...
Data Quality Requirements
- Minimum 50 episodes for basic training.
- 200+ episodes recommended for optimal results.
- Each episode should contain a complete task execution.
- Multi-view images (at least 2 cameras).
- High-quality action annotations.
Fine-tuning Training
Important Parameter Constraint
The ACT model's n_action_steps must be ≤ chunk_size. It is recommended to set both to the same value (e.g., 100).
chunk_size: Length of action sequence predicted by the model at once.n_action_steps: Number of steps actually executed.
Basic Training Command
# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0
# Start ACT training
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 8 \
--steps 50000 \
--output_dir outputs/train/act_finetuned \
--job_name act_finetuning \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 10000 \
--wandb.enable true
Advanced Training Configurations
Multi-camera Configuration
# ACT training for multi-camera setups
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 4 \
--steps 100000 \
--output_dir outputs/train/act_multicam \
--job_name act_multicam_training \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 2 \
--policy.vision_backbone resnet18 \
--policy.dim_model 512 \
--policy.dim_feedforward 3200 \
--policy.n_encoder_layers 4 \
--policy.n_decoder_layers 1 \
--policy.n_heads 8 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Memory-optimized Configuration
# For GPUs with smaller VRAM
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 2 \
--steps 75000 \
--output_dir outputs/train/act_memory_opt \
--job_name act_memory_optimized \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.vision_backbone resnet18 \
--policy.dim_model 256 \
--policy.optimizer_lr 1e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true
Parameter Details
Core Parameters
| Parameter | Meaning | Recommended Value | Description |
|---|---|---|---|
--policy.type | Policy type | act | ACT model type |
--policy.pretrained_path | Pre-trained model path | lerobot/act | Official LeRobot ACT model (optional) |
--dataset.repo_id | Dataset Repo ID | ${HF_USER}/your_dataset | Your Hugging Face dataset |
--batch_size | Batch size | 8 | Adjust based on VRAM; RTX 3070 recommends 4-8 |
--steps | Training steps | 50000 | Fine tasks recommend 50k-100k steps |
--output_dir | Output directory | outputs/train/act_finetuned | Path to save the model |
--job_name | Job name | act_finetuning | Used for logging and experiment tracking (optional) |
ACT-specific Parameters
| Parameter | Meaning | Recommended Value | Description |
|---|---|---|---|
--policy.chunk_size | Action chunk size | 100 | Number of actions predicted each time |
--policy.n_action_steps | Action steps to execute | 100 | Number of actions actually executed |
--policy.n_obs_steps | Observation steps | 1 | Number of historical observations |
--policy.vision_backbone | Vision backbone | resnet18 | Network for image feature extraction |
--policy.dim_model | Model dimension | 512 | Main Transformer dimension |
--policy.dim_feedforward | Feedforward dimension | 3200 | Transformer feedforward layer dimension |
--policy.n_encoder_layers | Encoder layers | 4 | Number of Transformer encoder layers |
--policy.n_decoder_layers | Decoder layers | 1 | Number of Transformer decoder layers |
--policy.n_heads | Attention heads | 8 | Number of multi-head attention heads |
--policy.use_vae | Use VAE | true | Variational objective optimization |
Training Parameters
| Parameter | Meaning | Recommended Value | Description |
|---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-5 | ACT recommends smaller learning rates |
--policy.optimizer_weight_decay | Weight decay | 0.0 | Regularization parameter |
--policy.optimizer_lr_backbone | Backbone learning rate | 1e-5 | Vision encoder learning rate |
--policy.use_amp | Mixed precision | true | Saves VRAM |
--num_workers | Data loading workers | 4 | Adjust based on CPU cores |
--policy.push_to_hub | Push to Hub | false | Upload model to Hugging Face (requires repo_id) |
--save_checkpoint | Save checkpoint | true | Save training checkpoints |
--save_freq | Save frequency | 10000 | Checkpoint saving interval |
Training Monitoring and Debugging
Weights & Biases Integration
# Detailed W&B configuration
lerobot-train \
--policy.type act \
--dataset.repo_id your-name/your-dataset \
--batch_size 8 \
--steps 50000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project act_experiments \
--wandb.notes "ACT training with 4 cameras" \
# ... other parameters
Key Metrics to Monitor
Metrics to focus on during training:
- Total Loss: Overall loss, should decrease steadily.
- Action Loss: Action prediction loss (L1/L2 loss).
- Learning Rate: Learning rate curve.
- Gradient Norm: Gradient norm to monitor gradient explosion.
- GPU Memory: VRAM usage.
- Training Speed: Number of samples processed per second.
Training Log Analysis
# log_analysis.py
import wandb
import matplotlib.pyplot as plt
def analyze_training_logs(project_name, run_name):
api = wandb.Api()
run = api.run(f"{project_name}/{run_name}")
# Get training metrics
history = run.history()
# Plot loss curves
plt.figure(figsize=(12, 4))
plt.subplot(1, 3, 1)
plt.plot(history['step'], history['train/total_loss'])
plt.title('Total Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.subplot(1, 3, 2)
plt.plot(history['step'], history['train/action_loss'])
plt.title('Action Loss')
plt.xlabel('Step')
plt.ylabel('Loss')
plt.subplot(1, 3, 3)
plt.plot(history['step'], history['train/learning_rate'])
plt.title('Learning Rate')
plt.xlabel('Step')
plt.ylabel('LR')
plt.tight_layout()
plt.savefig('training_analysis.png')
plt.show()
if __name__ == "__main__":
analyze_training_logs("act_experiments", "act_v1_multicam")
Model Evaluation
Offline Evaluation
# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def evaluate_act_model(model_path, dataset_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cuda")
policy.eval()
# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")
total_l1_loss = 0
total_l2_loss = 0
num_samples = 0
with torch.no_grad():
for batch in dataset:
# Model prediction
prediction = policy(batch)
# Calculate loss
target_actions = batch['action']
predicted_actions = prediction['action']
l1_loss = torch.mean(torch.abs(predicted_actions - target_actions))
l2_loss = torch.mean((predicted_actions - target_actions) ** 2)
total_l1_loss += l1_loss.item()
total_l2_loss += l2_loss.item()
num_samples += 1
avg_l1_loss = total_l1_loss / num_samples
avg_l2_loss = total_l2_loss / num_samples
print(f"Average L1 Loss: {avg_l1_loss:.4f}")
print(f"Average L2 Loss: {avg_l2_loss:.4f}")
return avg_l1_loss, avg_l2_loss
if __name__ == "__main__":
model_path = "outputs/train/act_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_act_model(model_path, dataset_path)
Online Evaluation (Robot Environment)
# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
class ACTRobotController:
def __init__(self, model_path, camera_names):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
self.camera_names = camera_names
self.action_queue = []
def get_action(self, observations):
# If action queue is empty, predict new action chunk
if len(self.action_queue) == 0:
with torch.no_grad():
# Build input
batch = self.prepare_observation(observations)
# Predict action chunk
prediction = self.policy(batch)
actions = prediction['action'].cpu().numpy()[0] # [chunk_size, action_dim]
# Add actions to queue
self.action_queue = list(actions)
# Return next action from queue
return self.action_queue.pop(0)
def prepare_observation(self, observations):
batch = {}
# Process image observations
for cam_name in self.camera_names:
image_key = f"observation.images.{cam_name}"
if image_key in observations:
image = observations[image_key]
# Preprocess image (normalize, resize, etc.)
image_tensor = self.preprocess_image(image)
batch[image_key] = image_tensor.unsqueeze(0)
# Process state observations
if "observation.state" in observations:
state = torch.tensor(observations["observation.state"], dtype=torch.float32)
batch["observation.state"] = state.unsqueeze(0)
return batch
def preprocess_image(self, image):
# Image preprocessing logic
# Must match training preprocessing
image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
return image_tensor
# Example usage
if __name__ == "__main__":
controller = ACTRobotController(
model_path="outputs/train/act_finetuned/checkpoints/last",
camera_names=["cam_high", "cam_low", "cam_left_wrist", "cam_right_wrist"]
)
# Simulated robot control loop
for step in range(100):
# Get current observation (should be from real robot)
observations = {
"observation.images.cam_high": np.random.randint(0, 255, (480, 640, 3)),
"observation.images.cam_low": np.random.randint(0, 255, (480, 640, 3)),
"observation.state": np.random.randn(7)
}
# Get action
action = controller.get_action(observations)
# Execute action (send to robot)
print(f"Step {step}: Action = {action}")
# Should send action to real robot here
# robot.execute_action(action)
Deployment and Optimization
Model Quantization
# quantization.py
import torch
from lerobot.policies.act.modeling_act import ACTPolicy
def quantize_act_model(model_path, output_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cpu")
policy.eval()
# Dynamic quantization
quantized_policy = torch.quantization.quantize_dynamic(
policy,
{torch.nn.Linear},
dtype=torch.qint8
)
# Save quantized model
torch.save(quantized_policy.state_dict(), output_path)
print(f"Quantized model saved to {output_path}")
return quantized_policy
if __name__ == "__main__":
quantize_act_model(
"outputs/train/act_finetuned/checkpoints/last",
"outputs/act_quantized.pth"
)
Inference Optimization
# optimized_inference.py
import torch
import torch.jit
from lerobot.policies.act.modeling_act import ACTPolicy
class OptimizedACTInference:
def __init__(self, model_path, use_jit=True):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
if use_jit:
# Use TorchScript optimization
self.policy = torch.jit.script(self.policy)
# Warmup model
self.warmup()
def warmup(self):
# Warmup with dummy data
dummy_batch = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}
with torch.no_grad():
for _ in range(10):
_ = self.policy(dummy_batch)
@torch.no_grad()
def predict(self, observations):
# Fast inference
prediction = self.policy(observations)
return prediction['action'].cpu().numpy()
if __name__ == "__main__":
inference = OptimizedACTInference(
"outputs/train/act_finetuned/checkpoints/last"
)
# Test inference speed
import time
observations = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}
start_time = time.time()
for _ in range(100):
action = inference.predict(observations)
end_time = time.time()
avg_inference_time = (end_time - start_time) / 100
print(f"Average inference time: {avg_inference_time:.4f} seconds")
print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")
Best Practices
Data Collection Suggestions
- Multi-view Data: Use multiple cameras to capture rich visual information.
- High-quality Demonstrations: Ensure consistency and accuracy in demonstration data.
- Task Diversity: Include different starting states and target configurations.
- Failure Cases: Appropriately include failure cases to improve robustness.
Training Optimization Suggestions
- Chunk Size: Adjust
chunk_sizebased on task complexity. - Learning Rate Scheduler: Use cosine annealing or step decay.
- Regularization: Use weight decay and dropout appropriately.
- Data Augmentation: Apply proper augmentation to images.
Deployment Optimization Suggestions
- Model Compression: Use quantization and pruning techniques to reduce model size.
- Inference Acceleration: Use TensorRT or ONNX for inference optimization.
- Memory Management: Manage action queues and observation buffers efficiently.
- Real-time Guarantee: Ensure inference frequency meets control requirements.
FAQ
Q: What are the advantages of ACT compared to other imitation learning methods?
A: Key advantages include:
- Reduced Compound Errors: Predicting action chunks reduces error accumulation.
- Improved Success Rates: Performs excellently on fine-grained manipulation tasks.
- End-to-End Training: No need for handcrafted features.
- Multi-modal Fusion: Effectively fuses vision and state information.
Q: How to choose the right chunk_size?
A: chunk_size depends on task characteristics:
- Fast Tasks:
chunk_size = 10-30. - Medium Tasks:
chunk_size = 50-100. - Slow Tasks:
chunk_size = 100-200. - Generally, starting with 50 is recommended.
Q: How long does training take?
A: Training time depends on:
- Dataset Size: 100 episodes take ~4-8 hours (RTX 3070).
- Model Complexity: Larger models take longer.
- Hardware Configuration: Better GPUs significantly reduce training time.
- Convergence Requirement: Typically 50,000-100,000 steps.
Q: How to handle multi-camera data?
A: Suggestions for multi-camera processing:
- Camera Selection: Choose viewpoints with complementary information.
- Feature Fusion: Fuse at the feature level.
- Attention Mechanism: Let the model learn to focus on important viewpoints.
- Computing Resources: Be aware that more cameras increase computational load.
Q: How to improve model generalization?
A: Methods to improve generalization:
- Data Diversity: Collect data under varying conditions.
- Data Augmentation: Use image and action augmentation.
- Regularization: Appropriate weight decay and dropout.
- Domain Randomization: Use domain randomization in simulations.
- Multi-task Learning: Jointly train on multiple related tasks.
Related Resources
- ACT Original Paper
- LeRobot ACT Implementation
- Official ACT Code
- Robotics Learning Course
- LeRobot Documentation
Change Log
- 2024-01: Initial version released.
- 2024-02: Added multi-camera support.
- 2024-03: Optimized training efficiency and inference speed.
- 2024-04: Added model compression and deployment optimization.