Skip to main content

ACT (Action Chunking Transformer) Model Fine-tuning

Overview

ACT (Action Chunking Transformer) is an end-to-end imitation learning model specifically designed for dexterous manipulation tasks. The model overcomes the compounding error problem in traditional imitation learning by predicting action chunks, achieving high success rates in robot manipulation on low-cost hardware.

Core Features

  • Action Chunking Prediction: Predicts multiple consecutive actions at once, reducing compounding errors
  • Transformer Architecture: Utilizes attention mechanism to process sequential data
  • End-to-End Training: Directly predicts actions from raw observations
  • High Success Rate: Excels at dexterous manipulation tasks
  • Hardware Friendly: Can run on consumer-grade hardware

Prerequisites

System Requirements

  • Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
  • Python Version: 3.8+
  • GPU: NVIDIA GPU (RTX 3070 or higher recommended), at least 6GB VRAM
  • Memory: At least 16GB RAM
  • Storage: At least 30GB available space

Environment Setup

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot

# Install dependencies
pip install -e .

2. Install ACT-Specific Dependencies

# Install additional packages
pip install einops
pip install timm
pip install wandb

# Login to Weights & Biases (optional)
wandb login

ACT Model Architecture

Core Components

  1. Vision Encoder: Processes multi-view image inputs
  2. State Encoder: Processes robot state information
  3. Transformer Decoder: Generates action sequences
  4. Action Head: Outputs final action predictions

Key Parameters

  • Chunk Size: Number of actions predicted at once (typically 50-100)
  • Context Length: Length of historical observations
  • Hidden Dimension: Transformer hidden dimension
  • Number of Heads: Number of attention heads
  • Number of Layers: Number of Transformer layers

Data Preparation

LeRobot Format Data

ACT requires using LeRobot format datasets, containing the following structure:

your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ ├── observation.images.cam_left_wrist.png
│ │ ├── observation.images.cam_right_wrist.png
│ │ ├── observation.state.npy
│ │ ├── action.npy
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...

Data Quality Requirements

  • Minimum 50 episodes for basic training
  • 200+ episodes recommended for optimal results
  • Each episode should contain complete task execution
  • Multi-view images (at least 2 cameras)
  • High-quality action annotations

Fine-tuning Training

Important Parameter Constraint

The ACT model's n_action_steps must be chunk_size. It is recommended to set both to the same value (e.g., both set to 100).

  • chunk_size: Length of action sequence predicted by the model at once
  • n_action_steps: Number of action steps actually executed

Basic Training Command

# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0

# Start ACT training
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 8 \
--steps 50000 \
--output_dir outputs/train/act_finetuned \
--job_name act_finetuning \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 10000 \
--wandb.enable true

Advanced Training Configuration

Multi-Camera Configuration

# ACT training with multi-camera setup
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 4 \
--steps 100000 \
--output_dir outputs/train/act_multicam \
--job_name act_multicam_training \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 2 \
--policy.vision_backbone resnet18 \
--policy.dim_model 512 \
--policy.dim_feedforward 3200 \
--policy.n_encoder_layers 4 \
--policy.n_decoder_layers 1 \
--policy.n_heads 8 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Memory Optimization Configuration

# For GPUs with smaller VRAM
lerobot-train \
--policy.type act \
--dataset.repo_id io-ai-data/lerobot_data \
--batch_size 2 \
--steps 75000 \
--output_dir outputs/train/act_memory_opt \
--job_name act_memory_optimized \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.vision_backbone resnet18 \
--policy.dim_model 256 \
--policy.optimizer_lr 1e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Parameter Details

Core Parameters

ParameterMeaningRecommended ValueDescription
--policy.typePolicy typeactACT model type
--policy.pretrained_pathPretrained model pathlerobot/actLeRobot official ACT model (optional)
--dataset.repo_idDataset repository ID${HF_USER}/datasetYour HuggingFace dataset
--batch_sizeBatch size8Adjust based on VRAM, RTX 3070 recommended 4-8
--stepsTraining steps50000Dexterous tasks recommended 50000-100000 steps
--output_dirOutput directoryoutputs/train/act_finetunedModel save path
--job_nameJob nameact_finetuningFor logging and experiment tracking (optional)

ACT-Specific Parameters

ParameterMeaningRecommended ValueDescription
--policy.chunk_sizeAction chunk size100Number of actions predicted each time
--policy.n_action_stepsExecute action steps100Number of actions actually executed
--policy.n_obs_stepsObservation steps1Number of historical observations
--policy.vision_backboneVision backboneresnet18Image feature extraction network
--policy.dim_modelModel dimension512Transformer main dimension
--policy.dim_feedforwardFeedforward dimension3200Transformer feedforward layer dimension
--policy.n_encoder_layersEncoder layers4Number of Transformer encoder layers
--policy.n_decoder_layersDecoder layers1Number of Transformer decoder layers
--policy.n_headsAttention heads8Number of multi-head attention heads
--policy.use_vaeUse VAEtrueVariational objective optimization

Training Parameters

ParameterMeaningRecommended ValueDescription
--policy.optimizer_lrLearning rate1e-5ACT recommends smaller learning rates
--policy.optimizer_weight_decayWeight decay0.0Regularization parameter
--policy.optimizer_lr_backboneBackbone learning rate1e-5Vision encoder learning rate
--policy.use_ampMixed precisiontrueSaves VRAM
--num_workersData loading threads4Adjust based on CPU core count
--policy.push_to_hubPush to HubfalseWhether to upload model to HuggingFace (requires repo_id)
--save_checkpointSave checkpointstrueWhether to save training checkpoints
--save_freqSave frequency10000Checkpoint save interval

Training Monitoring and Debugging

Weights & Biases Integration

# Detailed W&B configuration
lerobot-train \
--policy.type act \
--dataset.repo_id your-name/your-dataset \
--batch_size 8 \
--steps 50000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project act_experiments \
--wandb.notes "ACT training with 4 cameras" \
# ... other parameters

Key Metrics Monitoring

Metrics to monitor during training:

  • Total Loss: Overall loss, should steadily decrease
  • Action Loss: Action prediction loss (L1/L2 loss)
  • Learning Rate: Learning rate change curve
  • Gradient Norm: Gradient norm, monitor gradient explosion
  • GPU Memory: VRAM usage
  • Training Speed: Samples processed per second

Training Log Analysis

# log_analysis.py
import wandb
import matplotlib.pyplot as plt

def analyze_training_logs(project_name, run_name):
api = wandb.Api()
run = api.run(f"{project_name}/{run_name}")

# Get training metrics
history = run.history()

# Plot loss curves
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(history['step'], history['train/total_loss'])
plt.title('Total Loss')
plt.xlabel('Step')
plt.ylabel('Loss')

plt.subplot(1, 3, 2)
plt.plot(history['step'], history['train/action_loss'])
plt.title('Action Loss')
plt.xlabel('Step')
plt.ylabel('Loss')

plt.subplot(1, 3, 3)
plt.plot(history['step'], history['train/learning_rate'])
plt.title('Learning Rate')
plt.xlabel('Step')
plt.ylabel('LR')

plt.tight_layout()
plt.savefig('training_analysis.png')
plt.show()

if __name__ == "__main__":
analyze_training_logs("act_experiments", "act_v1_multicam")

Model Evaluation

Offline Evaluation

# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset

def evaluate_act_model(model_path, dataset_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cuda")
policy.eval()

# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")

total_l1_loss = 0
total_l2_loss = 0
num_samples = 0

with torch.no_grad():
for batch in dataset:
# Model prediction
prediction = policy(batch)

# Calculate loss
target_actions = batch['action']
predicted_actions = prediction['action']

l1_loss = torch.mean(torch.abs(predicted_actions - target_actions))
l2_loss = torch.mean((predicted_actions - target_actions) ** 2)

total_l1_loss += l1_loss.item()
total_l2_loss += l2_loss.item()
num_samples += 1

avg_l1_loss = total_l1_loss / num_samples
avg_l2_loss = total_l2_loss / num_samples

print(f"Average L1 Loss: {avg_l1_loss:.4f}")
print(f"Average L2 Loss: {avg_l2_loss:.4f}")

return avg_l1_loss, avg_l2_loss

if __name__ == "__main__":
model_path = "outputs/train/act_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_act_model(model_path, dataset_path)

Online Evaluation (Robot Environment)

# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy

class ACTRobotController:
def __init__(self, model_path, camera_names):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
self.camera_names = camera_names
self.action_queue = []

def get_action(self, observations):
# If action queue is empty, predict new action chunk
if len(self.action_queue) == 0:
with torch.no_grad():
# Build input
batch = self.prepare_observation(observations)

# Predict action chunk
prediction = self.policy(batch)
actions = prediction['action'].cpu().numpy()[0] # [chunk_size, action_dim]

# Add actions to queue
self.action_queue = list(actions)

# Return next action from queue
return self.action_queue.pop(0)

def prepare_observation(self, observations):
batch = {}

# Process image observations
for cam_name in self.camera_names:
image_key = f"observation.images.{cam_name}"
if image_key in observations:
image = observations[image_key]
# Preprocess image (normalize, resize, etc.)
image_tensor = self.preprocess_image(image)
batch[image_key] = image_tensor.unsqueeze(0)

# Process state observations
if "observation.state" in observations:
state = torch.tensor(observations["observation.state"], dtype=torch.float32)
batch["observation.state"] = state.unsqueeze(0)

return batch

def preprocess_image(self, image):
# Image preprocessing logic
# This needs to match the preprocessing used during training
image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
return image_tensor

# Usage example
if __name__ == "__main__":
controller = ACTRobotController(
model_path="outputs/train/act_finetuned/checkpoints/last",
camera_names=["cam_high", "cam_low", "cam_left_wrist", "cam_right_wrist"]
)

# Simulate robot control loop
for step in range(100):
# Get current observation (this needs to be obtained from the actual robot)
observations = {
"observation.images.cam_high": np.random.randint(0, 255, (480, 640, 3)),
"observation.images.cam_low": np.random.randint(0, 255, (480, 640, 3)),
"observation.state": np.random.randn(7)
}

# Get action
action = controller.get_action(observations)

# Execute action (send to robot)
print(f"Step {step}: Action = {action}")

# This should send the action to the actual robot
# robot.execute_action(action)

Deployment and Optimization

Model Quantization

# quantization.py
import torch
from lerobot.policies.act.modeling_act import ACTPolicy

def quantize_act_model(model_path, output_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cpu")
policy.eval()

# Dynamic quantization
quantized_policy = torch.quantization.quantize_dynamic(
policy,
{torch.nn.Linear},
dtype=torch.qint8
)

# Save quantized model
torch.save(quantized_policy.state_dict(), output_path)
print(f"Quantized model saved to {output_path}")

return quantized_policy

if __name__ == "__main__":
quantize_act_model(
"outputs/train/act_finetuned/checkpoints/last",
"outputs/act_quantized.pth"
)

Inference Optimization

# optimized_inference.py
import torch
import torch.jit
from lerobot.policies.act.modeling_act import ACTPolicy

class OptimizedACTInference:
def __init__(self, model_path, use_jit=True):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()

if use_jit:
# Optimize using TorchScript
self.policy = torch.jit.script(self.policy)

# Warmup model
self.warmup()

def warmup(self):
# Warmup model with dummy data
dummy_batch = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}

with torch.no_grad():
for _ in range(10):
_ = self.policy(dummy_batch)

@torch.no_grad()
def predict(self, observations):
# Fast inference
prediction = self.policy(observations)
return prediction['action'].cpu().numpy()

if __name__ == "__main__":
inference = OptimizedACTInference(
"outputs/train/act_finetuned/checkpoints/last"
)

# Test inference speed
import time

observations = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}

start_time = time.time()
for _ in range(100):
action = inference.predict(observations)
end_time = time.time()

avg_inference_time = (end_time - start_time) / 100
print(f"Average inference time: {avg_inference_time:.4f} seconds")
print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")

Best Practices

Data Collection Recommendations

  1. Multi-view Data: Use multiple cameras to obtain rich visual information
  2. High-quality Demonstrations: Ensure consistency and accuracy of demonstration data
  3. Task Diversity: Include different starting states and goal configurations
  4. Failure Cases: Appropriately include failure cases to improve robustness

Training Optimization Recommendations

  1. Action Chunk Size: Adjust chunk_size based on task complexity
  2. Learning Rate Scheduling: Use cosine annealing or step decay
  3. Regularization: Appropriately use weight decay and dropout
  4. Data Augmentation: Apply appropriate augmentation to images

Deployment Optimization Recommendations

  1. Model Compression: Use quantization and pruning techniques to reduce model size
  2. Inference Acceleration: Use TensorRT or ONNX for inference optimization
  3. Memory Management: Properly manage action queues and observation caches
  4. Real-time Guarantee: Ensure inference frequency meets control requirements

Frequently Asked Questions (FAQ)

Q: What advantages does ACT have compared to other imitation learning methods?

A: Main advantages of ACT include:

  • Reduced Compounding Error: Reduces error accumulation by predicting action chunks
  • Improved Success Rate: Excels at dexterous manipulation tasks
  • End-to-End Training: No need for hand-crafted features
  • Multimodal Fusion: Effectively fuses visual and state information

Q: How to choose the appropriate chunk_size?

A: The choice of chunk_size depends on task characteristics:

  • Fast tasks: chunk_size = 10-30
  • Medium tasks: chunk_size = 50-100
  • Slow tasks: chunk_size = 100-200
  • Generally recommended to start trying from 50

Q: How long does training take?

A: Training time depends on multiple factors:

  • Dataset size: 100 episodes take approximately 4-8 hours (RTX 3070)
  • Model complexity: Larger models require more time
  • Hardware configuration: Better GPUs can significantly reduce training time
  • Convergence requirement: Typically requires 50000-100000 steps

Q: How to handle multi-camera data?

A: Multi-camera processing recommendations:

  • Camera selection: Choose complementary viewpoints
  • Feature fusion: Fuse at the feature level
  • Attention mechanism: Let the model learn to focus on important viewpoints
  • Computing resources: Note that multi-camera increases computational burden

Q: How to improve model generalization?

A: Methods to improve generalization:

  • Data diversity: Collect data under different conditions
  • Data augmentation: Use image and action augmentation techniques
  • Regularization: Appropriate weight decay and dropout
  • Domain randomization: Use domain randomization techniques in simulation
  • Multi-task learning: Train jointly on multiple related tasks

Changelog

  • 2024-01: Initial version release
  • 2024-02: Added multi-camera support
  • 2024-03: Optimized training efficiency and inference speed
  • 2024-04: Added model compression and deployment optimization