Skip to main content

ACT (Action Chunking Transformer) Model Fine-tuning

Overview

ACT (Action Chunking Transformer) is an end-to-end imitation learning model specifically designed for fine-grained manipulation tasks. By predicting action chunks, the model overcomes the compound error problem common in traditional imitation learning, achieving high success rates in robot operations even on low-cost hardware.

Key Features

  • Action Chunking Prediction: Predicts multiple continuous actions at once to reduce compound errors.
  • Transformer Architecture: Utilizes attention mechanisms to process sequential data.
  • End-to-End Training: Predicts actions directly from raw observations.
  • High Success Rate: Performs exceptionally well on fine manipulation tasks.
  • Hardware Friendly: Capable of running on consumer-grade hardware.

Prerequisites

System Requirements

  • Operating System: Linux (Ubuntu 20.04+ recommended) or macOS.
  • Python Version: 3.8+.
  • GPU: NVIDIA GPU (RTX 3070 or higher recommended) with at least 6GB VRAM.
  • Memory: At least 16GB RAM.
  • Storage: At least 30GB available space.

Environment Preparation

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment (venv recommended; conda is also fine)
python -m venv .venv
source .venv/bin/activate
python -m pip install -U pip

# Install dependencies
pip install -e .

2. Install ACT-specific Dependencies

# Install additional packages
pip install einops
pip install timm
pip install wandb

# Login to Weights & Biases (optional)
wandb login

ACT Model Architecture

Core Components

  1. Vision Encoder: Processes multi-view image inputs.
  2. State Encoder: Processes robot state information.
  3. Transformer Decoder: Generates action sequences.
  4. Action Head: Outputs final action predictions.

Key Parameters

  • Chunk Size: Number of actions predicted at once (typically 50-100).
  • Context Length: Length of historical observations.
  • Hidden Dimension: Hidden dimension of the Transformer.
  • Number of Heads: Number of attention heads.
  • Number of Layers: Number of Transformer layers.

Data Preparation

LeRobot Format Data

ACT requires datasets in LeRobot format with the following structure:

your_dataset/
├── data/
│ ├── chunk-001/
│ │ ├── observation.images.cam_high.png
│ │ ├── observation.images.cam_low.png
│ │ ├── observation.images.cam_left_wrist.png
│ │ ├── observation.images.cam_right_wrist.png
│ │ ├── observation.state.npy
│ │ ├── action.npy
│ │ └── ...
│ └── chunk-002/
│ └── ...
├── meta.json
├── stats.safetensors
└── videos/
├── episode_000000.mp4
└── ...

Data Quality Requirements

  • Minimum 50 episodes for basic training.
  • 200+ episodes recommended for optimal results.
  • Each episode should contain a complete task execution.
  • Multi-view images (at least 2 cameras).
  • High-quality action annotations.

Fine-tuning Training

Important Parameter Constraint

The ACT model's n_action_steps must be chunk_size. It is recommended to set both to the same value (e.g., 100).

  • chunk_size: Length of action sequence predicted by the model at once.
  • n_action_steps: Number of steps actually executed.

Basic Training Command

# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0

# Start ACT training
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 8 \
--steps 50000 \
--output_dir outputs/train/act_finetuned \
--job_name act_finetuning \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--save_freq 10000 \
--wandb.enable true

Advanced Training Configurations

Multi-camera Configuration

# ACT training for multi-camera setups
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 4 \
--steps 100000 \
--output_dir outputs/train/act_multicam \
--job_name act_multicam_training \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 2 \
--policy.vision_backbone resnet18 \
--policy.dim_model 512 \
--policy.dim_feedforward 3200 \
--policy.n_encoder_layers 4 \
--policy.n_decoder_layers 1 \
--policy.n_heads 8 \
--policy.optimizer_lr 1e-5 \
--policy.optimizer_weight_decay 1e-4 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Memory-optimized Configuration

# For GPUs with smaller VRAM
lerobot-train \
--policy.type act \
--dataset.repo_id ${HF_USER}/your_dataset \
--batch_size 2 \
--steps 75000 \
--output_dir outputs/train/act_memory_opt \
--job_name act_memory_optimized \
--policy.device cuda \
--policy.chunk_size 100 \
--policy.n_action_steps 100 \
--policy.n_obs_steps 1 \
--policy.vision_backbone resnet18 \
--policy.dim_model 256 \
--policy.optimizer_lr 1e-5 \
--policy.use_amp true \
--num_workers 2 \
--policy.push_to_hub false \
--save_checkpoint true \
--wandb.enable true

Parameter Details

Core Parameters

ParameterMeaningRecommended ValueDescription
--policy.typePolicy typeactACT model type
--policy.pretrained_pathPre-trained model pathlerobot/actOfficial LeRobot ACT model (optional)
--dataset.repo_idDataset Repo ID${HF_USER}/your_datasetYour Hugging Face dataset
--batch_sizeBatch size8Adjust based on VRAM; RTX 3070 recommends 4-8
--stepsTraining steps50000Fine tasks recommend 50k-100k steps
--output_dirOutput directoryoutputs/train/act_finetunedPath to save the model
--job_nameJob nameact_finetuningUsed for logging and experiment tracking (optional)

ACT-specific Parameters

ParameterMeaningRecommended ValueDescription
--policy.chunk_sizeAction chunk size100Number of actions predicted each time
--policy.n_action_stepsAction steps to execute100Number of actions actually executed
--policy.n_obs_stepsObservation steps1Number of historical observations
--policy.vision_backboneVision backboneresnet18Network for image feature extraction
--policy.dim_modelModel dimension512Main Transformer dimension
--policy.dim_feedforwardFeedforward dimension3200Transformer feedforward layer dimension
--policy.n_encoder_layersEncoder layers4Number of Transformer encoder layers
--policy.n_decoder_layersDecoder layers1Number of Transformer decoder layers
--policy.n_headsAttention heads8Number of multi-head attention heads
--policy.use_vaeUse VAEtrueVariational objective optimization

Training Parameters

ParameterMeaningRecommended ValueDescription
--policy.optimizer_lrLearning rate1e-5ACT recommends smaller learning rates
--policy.optimizer_weight_decayWeight decay0.0Regularization parameter
--policy.optimizer_lr_backboneBackbone learning rate1e-5Vision encoder learning rate
--policy.use_ampMixed precisiontrueSaves VRAM
--num_workersData loading workers4Adjust based on CPU cores
--policy.push_to_hubPush to HubfalseUpload model to Hugging Face (requires repo_id)
--save_checkpointSave checkpointtrueSave training checkpoints
--save_freqSave frequency10000Checkpoint saving interval

Training Monitoring and Debugging

Weights & Biases Integration

# Detailed W&B configuration
lerobot-train \
--policy.type act \
--dataset.repo_id your-name/your-dataset \
--batch_size 8 \
--steps 50000 \
--policy.push_to_hub false \
--wandb.enable true \
--wandb.project act_experiments \
--wandb.notes "ACT training with 4 cameras" \
# ... other parameters

Key Metrics to Monitor

Metrics to focus on during training:

  • Total Loss: Overall loss, should decrease steadily.
  • Action Loss: Action prediction loss (L1/L2 loss).
  • Learning Rate: Learning rate curve.
  • Gradient Norm: Gradient norm to monitor gradient explosion.
  • GPU Memory: VRAM usage.
  • Training Speed: Number of samples processed per second.

Training Log Analysis

# log_analysis.py
import wandb
import matplotlib.pyplot as plt

def analyze_training_logs(project_name, run_name):
api = wandb.Api()
run = api.run(f"{project_name}/{run_name}")

# Get training metrics
history = run.history()

# Plot loss curves
plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
plt.plot(history['step'], history['train/total_loss'])
plt.title('Total Loss')
plt.xlabel('Step')
plt.ylabel('Loss')

plt.subplot(1, 3, 2)
plt.plot(history['step'], history['train/action_loss'])
plt.title('Action Loss')
plt.xlabel('Step')
plt.ylabel('Loss')

plt.subplot(1, 3, 3)
plt.plot(history['step'], history['train/learning_rate'])
plt.title('Learning Rate')
plt.xlabel('Step')
plt.ylabel('LR')

plt.tight_layout()
plt.savefig('training_analysis.png')
plt.show()

if __name__ == "__main__":
analyze_training_logs("act_experiments", "act_v1_multicam")

Model Evaluation

Offline Evaluation

# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset

def evaluate_act_model(model_path, dataset_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cuda")
policy.eval()

# Load test dataset
dataset = LeRobotDataset(dataset_path, split="test")

total_l1_loss = 0
total_l2_loss = 0
num_samples = 0

with torch.no_grad():
for batch in dataset:
# Model prediction
prediction = policy(batch)

# Calculate loss
target_actions = batch['action']
predicted_actions = prediction['action']

l1_loss = torch.mean(torch.abs(predicted_actions - target_actions))
l2_loss = torch.mean((predicted_actions - target_actions) ** 2)

total_l1_loss += l1_loss.item()
total_l2_loss += l2_loss.item()
num_samples += 1

avg_l1_loss = total_l1_loss / num_samples
avg_l2_loss = total_l2_loss / num_samples

print(f"Average L1 Loss: {avg_l1_loss:.4f}")
print(f"Average L2 Loss: {avg_l2_loss:.4f}")

return avg_l1_loss, avg_l2_loss

if __name__ == "__main__":
model_path = "outputs/train/act_finetuned/checkpoints/last"
dataset_path = "path/to/your/test/dataset"
evaluate_act_model(model_path, dataset_path)

Online Evaluation (Robot Environment)

# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy

class ACTRobotController:
def __init__(self, model_path, camera_names):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()
self.camera_names = camera_names
self.action_queue = []

def get_action(self, observations):
# If action queue is empty, predict new action chunk
if len(self.action_queue) == 0:
with torch.no_grad():
# Build input
batch = self.prepare_observation(observations)

# Predict action chunk
prediction = self.policy(batch)
actions = prediction['action'].cpu().numpy()[0] # [chunk_size, action_dim]

# Add actions to queue
self.action_queue = list(actions)

# Return next action from queue
return self.action_queue.pop(0)

def prepare_observation(self, observations):
batch = {}

# Process image observations
for cam_name in self.camera_names:
image_key = f"observation.images.{cam_name}"
if image_key in observations:
image = observations[image_key]
# Preprocess image (normalize, resize, etc.)
image_tensor = self.preprocess_image(image)
batch[image_key] = image_tensor.unsqueeze(0)

# Process state observations
if "observation.state" in observations:
state = torch.tensor(observations["observation.state"], dtype=torch.float32)
batch["observation.state"] = state.unsqueeze(0)

return batch

def preprocess_image(self, image):
# Image preprocessing logic
# Must match training preprocessing
image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
return image_tensor

# Example usage
if __name__ == "__main__":
controller = ACTRobotController(
model_path="outputs/train/act_finetuned/checkpoints/last",
camera_names=["cam_high", "cam_low", "cam_left_wrist", "cam_right_wrist"]
)

# Simulated robot control loop
for step in range(100):
# Get current observation (should be from real robot)
observations = {
"observation.images.cam_high": np.random.randint(0, 255, (480, 640, 3)),
"observation.images.cam_low": np.random.randint(0, 255, (480, 640, 3)),
"observation.state": np.random.randn(7)
}

# Get action
action = controller.get_action(observations)

# Execute action (send to robot)
print(f"Step {step}: Action = {action}")

# Should send action to real robot here
# robot.execute_action(action)

Deployment and Optimization

Model Quantization

# quantization.py
import torch
from lerobot.policies.act.modeling_act import ACTPolicy

def quantize_act_model(model_path, output_path):
# Load model
policy = ACTPolicy.from_pretrained(model_path, device="cpu")
policy.eval()

# Dynamic quantization
quantized_policy = torch.quantization.quantize_dynamic(
policy,
{torch.nn.Linear},
dtype=torch.qint8
)

# Save quantized model
torch.save(quantized_policy.state_dict(), output_path)
print(f"Quantized model saved to {output_path}")

return quantized_policy

if __name__ == "__main__":
quantize_act_model(
"outputs/train/act_finetuned/checkpoints/last",
"outputs/act_quantized.pth"
)

Inference Optimization

# optimized_inference.py
import torch
import torch.jit
from lerobot.policies.act.modeling_act import ACTPolicy

class OptimizedACTInference:
def __init__(self, model_path, use_jit=True):
self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
self.policy.eval()

if use_jit:
# Use TorchScript optimization
self.policy = torch.jit.script(self.policy)

# Warmup model
self.warmup()

def warmup(self):
# Warmup with dummy data
dummy_batch = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}

with torch.no_grad():
for _ in range(10):
_ = self.policy(dummy_batch)

@torch.no_grad()
def predict(self, observations):
# Fast inference
prediction = self.policy(observations)
return prediction['action'].cpu().numpy()

if __name__ == "__main__":
inference = OptimizedACTInference(
"outputs/train/act_finetuned/checkpoints/last"
)

# Test inference speed
import time

observations = {
"observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
"observation.state": torch.randn(1, 7, device="cuda")
}

start_time = time.time()
for _ in range(100):
action = inference.predict(observations)
end_time = time.time()

avg_inference_time = (end_time - start_time) / 100
print(f"Average inference time: {avg_inference_time:.4f} seconds")
print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")

Best Practices

Data Collection Suggestions

  1. Multi-view Data: Use multiple cameras to capture rich visual information.
  2. High-quality Demonstrations: Ensure consistency and accuracy in demonstration data.
  3. Task Diversity: Include different starting states and target configurations.
  4. Failure Cases: Appropriately include failure cases to improve robustness.

Training Optimization Suggestions

  1. Chunk Size: Adjust chunk_size based on task complexity.
  2. Learning Rate Scheduler: Use cosine annealing or step decay.
  3. Regularization: Use weight decay and dropout appropriately.
  4. Data Augmentation: Apply proper augmentation to images.

Deployment Optimization Suggestions

  1. Model Compression: Use quantization and pruning techniques to reduce model size.
  2. Inference Acceleration: Use TensorRT or ONNX for inference optimization.
  3. Memory Management: Manage action queues and observation buffers efficiently.
  4. Real-time Guarantee: Ensure inference frequency meets control requirements.

FAQ

Q: What are the advantages of ACT compared to other imitation learning methods?

A: Key advantages include:

  • Reduced Compound Errors: Predicting action chunks reduces error accumulation.
  • Improved Success Rates: Performs excellently on fine-grained manipulation tasks.
  • End-to-End Training: No need for handcrafted features.
  • Multi-modal Fusion: Effectively fuses vision and state information.

Q: How to choose the right chunk_size?

A: chunk_size depends on task characteristics:

  • Fast Tasks: chunk_size = 10-30.
  • Medium Tasks: chunk_size = 50-100.
  • Slow Tasks: chunk_size = 100-200.
  • Generally, starting with 50 is recommended.

Q: How long does training take?

A: Training time depends on:

  • Dataset Size: 100 episodes take ~4-8 hours (RTX 3070).
  • Model Complexity: Larger models take longer.
  • Hardware Configuration: Better GPUs significantly reduce training time.
  • Convergence Requirement: Typically 50,000-100,000 steps.

Q: How to handle multi-camera data?

A: Suggestions for multi-camera processing:

  • Camera Selection: Choose viewpoints with complementary information.
  • Feature Fusion: Fuse at the feature level.
  • Attention Mechanism: Let the model learn to focus on important viewpoints.
  • Computing Resources: Be aware that more cameras increase computational load.

Q: How to improve model generalization?

A: Methods to improve generalization:

  • Data Diversity: Collect data under varying conditions.
  • Data Augmentation: Use image and action augmentation.
  • Regularization: Appropriate weight decay and dropout.
  • Domain Randomization: Use domain randomization in simulations.
  • Multi-task Learning: Jointly train on multiple related tasks.

Change Log

  • 2024-01: Initial version released.
  • 2024-02: Added multi-camera support.
  • 2024-03: Optimized training efficiency and inference speed.
  • 2024-04: Added model compression and deployment optimization.