ACT (Action Chunking Transformer) Model Fine-tuning

Overview

ACT (Action Chunking Transformer) is an end-to-end imitation learning model specifically designed for dexterous manipulation tasks. The model overcomes the compounding error problem in traditional imitation learning by predicting action chunks, achieving high success rates in robot manipulation on low-cost hardware.

Core Features

Action Chunking Prediction: Predicts multiple consecutive actions at once, reducing compounding errors
Transformer Architecture: Utilizes attention mechanism to process sequential data
End-to-End Training: Directly predicts actions from raw observations
High Success Rate: Excels at dexterous manipulation tasks
Hardware Friendly: Can run on consumer-grade hardware

Prerequisites

System Requirements

Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
Python Version: 3.8+
GPU: NVIDIA GPU (RTX 3070 or higher recommended), at least 6GB VRAM
Memory: At least 16GB RAM
Storage: At least 30GB available space

Environment Setup

1. Install LeRobot

# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot

# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot

# Install dependencies
pip install -e .

2. Install ACT-Specific Dependencies

# Install additional packages
pip install einops
pip install timm
pip install wandb

# Login to Weights & Biases (optional)
wandb login

ACT Model Architecture

Core Components

Vision Encoder: Processes multi-view image inputs
State Encoder: Processes robot state information
Transformer Decoder: Generates action sequences
Action Head: Outputs final action predictions

Key Parameters

Chunk Size: Number of actions predicted at once (typically 50-100)
Context Length: Length of historical observations
Hidden Dimension: Transformer hidden dimension
Number of Heads: Number of attention heads
Number of Layers: Number of Transformer layers

Data Preparation

LeRobot Format Data

ACT requires using LeRobot format datasets, containing the following structure:

your_dataset/
├── data/
│   ├── chunk-001/
│   │   ├── observation.images.cam_high.png
│   │   ├── observation.images.cam_low.png
│   │   ├── observation.images.cam_left_wrist.png
│   │   ├── observation.images.cam_right_wrist.png
│   │   ├── observation.state.npy
│   │   ├── action.npy
│   │   └── ...
│   └── chunk-002/
│       └── ...
├── meta.json
├── stats.safetensors
└── videos/
    ├── episode_000000.mp4
    └── ...

Data Quality Requirements

Minimum 50 episodes for basic training
200+ episodes recommended for optimal results
Each episode should contain complete task execution
Multi-view images (at least 2 cameras)
High-quality action annotations

Fine-tuning Training

Important Parameter Constraint

The ACT model's n_action_steps must be ≤ chunk_size. It is recommended to set both to the same value (e.g., both set to 100).

chunk_size: Length of action sequence predicted by the model at once
n_action_steps: Number of action steps actually executed

Basic Training Command

# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0

# Start ACT training
lerobot-train \
  --policy.type act \
  --dataset.repo_id ${HF_USER}/your_dataset \
  --batch_size 8 \
  --steps 50000 \
  --output_dir outputs/train/act_finetuned \
  --job_name act_finetuning \
  --policy.device cuda \
  --policy.chunk_size 100 \
  --policy.n_action_steps 100 \
  --policy.n_obs_steps 1 \
  --policy.optimizer_lr 1e-5 \
  --policy.optimizer_weight_decay 1e-4 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --save_freq 10000 \
  --wandb.enable true

Advanced Training Configuration

Multi-Camera Configuration

# ACT training with multi-camera setup
lerobot-train \
  --policy.type act \
  --dataset.repo_id ${HF_USER}/your_dataset \
  --batch_size 4 \
  --steps 100000 \
  --output_dir outputs/train/act_multicam \
  --job_name act_multicam_training \
  --policy.device cuda \
  --policy.chunk_size 100 \
  --policy.n_action_steps 100 \
  --policy.n_obs_steps 2 \
  --policy.vision_backbone resnet18 \
  --policy.dim_model 512 \
  --policy.dim_feedforward 3200 \
  --policy.n_encoder_layers 4 \
  --policy.n_decoder_layers 1 \
  --policy.n_heads 8 \
  --policy.optimizer_lr 1e-5 \
  --policy.optimizer_weight_decay 1e-4 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true

Memory Optimization Configuration

# For GPUs with smaller VRAM
lerobot-train \
  --policy.type act \
  --dataset.repo_id io-ai-data/lerobot_data \
  --batch_size 2 \
  --steps 75000 \
  --output_dir outputs/train/act_memory_opt \
  --job_name act_memory_optimized \
  --policy.device cuda \
  --policy.chunk_size 100 \
  --policy.n_action_steps 100 \
  --policy.n_obs_steps 1 \
  --policy.vision_backbone resnet18 \
  --policy.dim_model 256 \
  --policy.optimizer_lr 1e-5 \
  --policy.use_amp true \
  --num_workers 2 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true

Parameter Details

Core Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.type`	Policy type	`act`	ACT model type
`--policy.pretrained_path`	Pretrained model path	`lerobot/act`	LeRobot official ACT model (optional)
`--dataset.repo_id`	Dataset repository ID	`${HF_USER}/dataset`	Your HuggingFace dataset
`--batch_size`	Batch size	`8`	Adjust based on VRAM, RTX 3070 recommended 4-8
`--steps`	Training steps	`50000`	Dexterous tasks recommended 50000-100000 steps
`--output_dir`	Output directory	`outputs/train/act_finetuned`	Model save path
`--job_name`	Job name	`act_finetuning`	For logging and experiment tracking (optional)

ACT-Specific Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.chunk_size`	Action chunk size	`100`	Number of actions predicted each time
`--policy.n_action_steps`	Execute action steps	`100`	Number of actions actually executed
`--policy.n_obs_steps`	Observation steps	`1`	Number of historical observations
`--policy.vision_backbone`	Vision backbone	`resnet18`	Image feature extraction network
`--policy.dim_model`	Model dimension	`512`	Transformer main dimension
`--policy.dim_feedforward`	Feedforward dimension	`3200`	Transformer feedforward layer dimension
`--policy.n_encoder_layers`	Encoder layers	`4`	Number of Transformer encoder layers
`--policy.n_decoder_layers`	Decoder layers	`1`	Number of Transformer decoder layers
`--policy.n_heads`	Attention heads	`8`	Number of multi-head attention heads
`--policy.use_vae`	Use VAE	`true`	Variational objective optimization

Training Parameters

Parameter	Meaning	Recommended Value	Description
`--policy.optimizer_lr`	Learning rate	`1e-5`	ACT recommends smaller learning rates
`--policy.optimizer_weight_decay`	Weight decay	`0.0`	Regularization parameter
`--policy.optimizer_lr_backbone`	Backbone learning rate	`1e-5`	Vision encoder learning rate
`--policy.use_amp`	Mixed precision	`true`	Saves VRAM
`--num_workers`	Data loading threads	`4`	Adjust based on CPU core count
`--policy.push_to_hub`	Push to Hub	`false`	Whether to upload model to HuggingFace (requires repo_id)
`--save_checkpoint`	Save checkpoints	`true`	Whether to save training checkpoints
`--save_freq`	Save frequency	`10000`	Checkpoint save interval

Training Monitoring and Debugging

Weights & Biases Integration

# Detailed W&B configuration
lerobot-train \
  --policy.type act \
  --dataset.repo_id your-name/your-dataset \
  --batch_size 8 \
  --steps 50000 \
  --policy.push_to_hub false \
  --wandb.enable true \
  --wandb.project act_experiments \
  --wandb.notes "ACT training with 4 cameras" \
  # ... other parameters

Key Metrics Monitoring

Metrics to monitor during training:

Total Loss: Overall loss, should steadily decrease
Action Loss: Action prediction loss (L1/L2 loss)
Learning Rate: Learning rate change curve
Gradient Norm: Gradient norm, monitor gradient explosion
GPU Memory: VRAM usage
Training Speed: Samples processed per second

Training Log Analysis

# log_analysis.py
import wandb
import matplotlib.pyplot as plt

def analyze_training_logs(project_name, run_name):
    api = wandb.Api()
    run = api.run(f"{project_name}/{run_name}")
    
    # Get training metrics
    history = run.history()
    
    # Plot loss curves
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 3, 1)
    plt.plot(history['step'], history['train/total_loss'])
    plt.title('Total Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')
    
    plt.subplot(1, 3, 2)
    plt.plot(history['step'], history['train/action_loss'])
    plt.title('Action Loss')
    plt.xlabel('Step')
    plt.ylabel('Loss')
    
    plt.subplot(1, 3, 3)
    plt.plot(history['step'], history['train/learning_rate'])
    plt.title('Learning Rate')
    plt.xlabel('Step')
    plt.ylabel('LR')
    
    plt.tight_layout()
    plt.savefig('training_analysis.png')
    plt.show()

if __name__ == "__main__":
    analyze_training_logs("act_experiments", "act_v1_multicam")

Model Evaluation

Offline Evaluation

# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset

def evaluate_act_model(model_path, dataset_path):
    # Load model
    policy = ACTPolicy.from_pretrained(model_path, device="cuda")
    policy.eval()
    
    # Load test dataset
    dataset = LeRobotDataset(dataset_path, split="test")
    
    total_l1_loss = 0
    total_l2_loss = 0
    num_samples = 0
    
    with torch.no_grad():
        for batch in dataset:
            # Model prediction
            prediction = policy(batch)
            
            # Calculate loss
            target_actions = batch['action']
            predicted_actions = prediction['action']
            
            l1_loss = torch.mean(torch.abs(predicted_actions - target_actions))
            l2_loss = torch.mean((predicted_actions - target_actions) ** 2)
            
            total_l1_loss += l1_loss.item()
            total_l2_loss += l2_loss.item()
            num_samples += 1
    
    avg_l1_loss = total_l1_loss / num_samples
    avg_l2_loss = total_l2_loss / num_samples
    
    print(f"Average L1 Loss: {avg_l1_loss:.4f}")
    print(f"Average L2 Loss: {avg_l2_loss:.4f}")
    
    return avg_l1_loss, avg_l2_loss

if __name__ == "__main__":
    model_path = "outputs/train/act_finetuned/checkpoints/last"
    dataset_path = "path/to/your/test/dataset"
    evaluate_act_model(model_path, dataset_path)

Online Evaluation (Robot Environment)

# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.act.modeling_act import ACTPolicy

class ACTRobotController:
    def __init__(self, model_path, camera_names):
        self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
        self.policy.eval()
        self.camera_names = camera_names
        self.action_queue = []
        
    def get_action(self, observations):
        # If action queue is empty, predict new action chunk
        if len(self.action_queue) == 0:
            with torch.no_grad():
                # Build input
                batch = self.prepare_observation(observations)
                
                # Predict action chunk
                prediction = self.policy(batch)
                actions = prediction['action'].cpu().numpy()[0]  # [chunk_size, action_dim]
                
                # Add actions to queue
                self.action_queue = list(actions)
        
        # Return next action from queue
        return self.action_queue.pop(0)
    
    def prepare_observation(self, observations):
        batch = {}
        
        # Process image observations
        for cam_name in self.camera_names:
            image_key = f"observation.images.{cam_name}"
            if image_key in observations:
                image = observations[image_key]
                # Preprocess image (normalize, resize, etc.)
                image_tensor = self.preprocess_image(image)
                batch[image_key] = image_tensor.unsqueeze(0)
        
        # Process state observations
        if "observation.state" in observations:
            state = torch.tensor(observations["observation.state"], dtype=torch.float32)
            batch["observation.state"] = state.unsqueeze(0)
        
        return batch
    
    def preprocess_image(self, image):
        # Image preprocessing logic
        # This needs to match the preprocessing used during training
        image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
        return image_tensor

# Usage example
if __name__ == "__main__":
    controller = ACTRobotController(
        model_path="outputs/train/act_finetuned/checkpoints/last",
        camera_names=["cam_high", "cam_low", "cam_left_wrist", "cam_right_wrist"]
    )
    
    # Simulate robot control loop
    for step in range(100):
        # Get current observation (this needs to be obtained from the actual robot)
        observations = {
            "observation.images.cam_high": np.random.randint(0, 255, (480, 640, 3)),
            "observation.images.cam_low": np.random.randint(0, 255, (480, 640, 3)),
            "observation.state": np.random.randn(7)
        }
        
        # Get action
        action = controller.get_action(observations)
        
        # Execute action (send to robot)
        print(f"Step {step}: Action = {action}")
        
        # This should send the action to the actual robot
        # robot.execute_action(action)

Deployment and Optimization

Model Quantization

# quantization.py
import torch
from lerobot.policies.act.modeling_act import ACTPolicy

def quantize_act_model(model_path, output_path):
    # Load model
    policy = ACTPolicy.from_pretrained(model_path, device="cpu")
    policy.eval()
    
    # Dynamic quantization
    quantized_policy = torch.quantization.quantize_dynamic(
        policy, 
        {torch.nn.Linear}, 
        dtype=torch.qint8
    )
    
    # Save quantized model
    torch.save(quantized_policy.state_dict(), output_path)
    print(f"Quantized model saved to {output_path}")
    
    return quantized_policy

if __name__ == "__main__":
    quantize_act_model(
        "outputs/train/act_finetuned/checkpoints/last",
        "outputs/act_quantized.pth"
    )

Inference Optimization

# optimized_inference.py
import torch
import torch.jit
from lerobot.policies.act.modeling_act import ACTPolicy

class OptimizedACTInference:
    def __init__(self, model_path, use_jit=True):
        self.policy = ACTPolicy.from_pretrained(model_path, device="cuda")
        self.policy.eval()
        
        if use_jit:
            # Optimize using TorchScript
            self.policy = torch.jit.script(self.policy)
        
        # Warmup model
        self.warmup()
    
    def warmup(self):
        # Warmup model with dummy data
        dummy_batch = {
            "observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
            "observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
            "observation.state": torch.randn(1, 7, device="cuda")
        }
        
        with torch.no_grad():
            for _ in range(10):
                _ = self.policy(dummy_batch)
    
    @torch.no_grad()
    def predict(self, observations):
        # Fast inference
        prediction = self.policy(observations)
        return prediction['action'].cpu().numpy()

if __name__ == "__main__":
    inference = OptimizedACTInference(
        "outputs/train/act_finetuned/checkpoints/last"
    )
    
    # Test inference speed
    import time
    
    observations = {
        "observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
        "observation.images.cam_low": torch.randn(1, 3, 224, 224, device="cuda"),
        "observation.state": torch.randn(1, 7, device="cuda")
    }
    
    start_time = time.time()
    for _ in range(100):
        action = inference.predict(observations)
    end_time = time.time()
    
    avg_inference_time = (end_time - start_time) / 100
    print(f"Average inference time: {avg_inference_time:.4f} seconds")
    print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")

Best Practices

Data Collection Recommendations

Multi-view Data: Use multiple cameras to obtain rich visual information
High-quality Demonstrations: Ensure consistency and accuracy of demonstration data
Task Diversity: Include different starting states and goal configurations
Failure Cases: Appropriately include failure cases to improve robustness

Training Optimization Recommendations

Action Chunk Size: Adjust chunk_size based on task complexity
Learning Rate Scheduling: Use cosine annealing or step decay
Regularization: Appropriately use weight decay and dropout
Data Augmentation: Apply appropriate augmentation to images

Deployment Optimization Recommendations

Model Compression: Use quantization and pruning techniques to reduce model size
Inference Acceleration: Use TensorRT or ONNX for inference optimization
Memory Management: Properly manage action queues and observation caches
Real-time Guarantee: Ensure inference frequency meets control requirements

Frequently Asked Questions (FAQ)

Q: What advantages does ACT have compared to other imitation learning methods?

A: Main advantages of ACT include:

Reduced Compounding Error: Reduces error accumulation by predicting action chunks
Improved Success Rate: Excels at dexterous manipulation tasks
End-to-End Training: No need for hand-crafted features
Multimodal Fusion: Effectively fuses visual and state information

Q: How to choose the appropriate chunk_size?

A: The choice of chunk_size depends on task characteristics:

Fast tasks: chunk_size = 10-30
Medium tasks: chunk_size = 50-100
Slow tasks: chunk_size = 100-200
Generally recommended to start trying from 50

Q: How long does training take?

A: Training time depends on multiple factors:

Dataset size: 100 episodes take approximately 4-8 hours (RTX 3070)
Model complexity: Larger models require more time
Hardware configuration: Better GPUs can significantly reduce training time
Convergence requirement: Typically requires 50000-100000 steps

Q: How to handle multi-camera data?

A: Multi-camera processing recommendations:

Camera selection: Choose complementary viewpoints
Feature fusion: Fuse at the feature level
Attention mechanism: Let the model learn to focus on important viewpoints
Computing resources: Note that multi-camera increases computational burden

Q: How to improve model generalization?

A: Methods to improve generalization:

Data diversity: Collect data under different conditions
Data augmentation: Use image and action augmentation techniques
Regularization: Appropriate weight decay and dropout
Domain randomization: Use domain randomization techniques in simulation
Multi-task learning: Train jointly on multiple related tasks

Changelog

2024-01: Initial version release
2024-02: Added multi-camera support
2024-03: Optimized training efficiency and inference speed
2024-04: Added model compression and deployment optimization

Overview​

Core Features​

Prerequisites​

System Requirements​

Environment Setup​

1. Install LeRobot​

2. Install ACT-Specific Dependencies​

ACT Model Architecture​

Core Components​

Key Parameters​

Data Preparation​

LeRobot Format Data​

Data Quality Requirements​

Fine-tuning Training​

Basic Training Command​

Advanced Training Configuration​

Multi-Camera Configuration​

Memory Optimization Configuration​

Parameter Details​

Core Parameters​

ACT-Specific Parameters​

Training Parameters​

Training Monitoring and Debugging​

Weights & Biases Integration​

Key Metrics Monitoring​

Training Log Analysis​

Model Evaluation​

Offline Evaluation​

Online Evaluation (Robot Environment)​

Deployment and Optimization​

Model Quantization​

Inference Optimization​

Best Practices​

Data Collection Recommendations​

Training Optimization Recommendations​

Deployment Optimization Recommendations​

Frequently Asked Questions (FAQ)​

Q: What advantages does ACT have compared to other imitation learning methods?​

Q: How to choose the appropriate chunk_size?​

Q: How long does training take?​

Q: How to handle multi-camera data?​

Q: How to improve model generalization?​

Related Resources​

Changelog​