Diffusion Policy Model Fine-tuning
Overview
Diffusion Policy is a visuomotor policy learning method based on diffusion models, applying the generative capabilities of diffusion models to the field of robot control. This method generates diverse and high-quality robot action sequences by learning the diffusion process of action distributions, demonstrating excellent performance in complex robot manipulation tasks.
Core Features
- Diffusion Generation: Uses diffusion models to generate continuous action sequences
 - Multimodal Actions: Can handle tasks with multiple solutions
 - High-Quality Output: Generates smooth and natural robot actions
 - Strong Robustness: Good robustness to noise and perturbations
 - Strong Expressiveness: Can learn complex action distributions
 
Prerequisites
System Requirements
- Operating System: Linux (Ubuntu 20.04+ recommended) or macOS
 - Python Version: 3.8+
 - GPU: NVIDIA GPU (RTX 3080 or higher recommended), at least 10GB VRAM
 - Memory: At least 32GB RAM
 - Storage: At least 50GB available space
 
Environment Setup
1. Install LeRobot
# Clone LeRobot repository
git clone https://github.com/huggingface/lerobot.git
cd lerobot
# Create virtual environment
conda create -n lerobot python=3.10
conda activate lerobot
# Install dependencies
pip install -e .
2. Install Diffusion Policy-Specific Dependencies
# Install diffusion model related dependencies
pip install diffusers
pip install accelerate
pip install transformers
pip install einops
pip install wandb
# Install numerical computing libraries
pip install scipy
pip install scikit-learn
# Login to Weights & Biases (optional)
wandb login
Diffusion Policy Architecture
Core Components
- Vision Encoder: Extracts image features
 - State Encoder: Processes robot state information
 - Conditional Encoder: Fuses vision and state information
 - Diffusion Network: Learns the diffusion process of action distributions
 - Noise Scheduler: Controls noise levels in the diffusion process
 
Diffusion Process
- Forward Process: Gradually adds noise to action sequences
 - Reverse Process: Gradually recovers action sequences from noise
 - Conditional Generation: Generates actions based on observation conditions
 - Sampling Strategy: Uses DDPM or DDIM sampling
 
Data Preparation
LeRobot Format Data
Diffusion Policy requires using LeRobot format datasets:
your_dataset/
├── data/
│   ├── chunk-001/
│   │   ├── observation.images.cam_high.png
│   │   ├── observation.images.cam_low.png
│   │   ├── observation.state.npy
│   │   ├── action.npy
│   │   └── ...
│   └── chunk-002/
│       └── ...
├── meta.json
├── stats.safetensors
└── videos/
    ├── episode_000000.mp4
    └── ...
Data Quality Requirements
- Minimum 100 episodes for basic training
 - 500+ episodes recommended for optimal results
 - Action sequences should be smooth and continuous
 - Include diverse task scenarios
 - High-quality visual observation data
 
Fine-tuning Training
Basic Training Command
# Set environment variables
export HF_USER="your-huggingface-username"
export CUDA_VISIBLE_DEVICES=0
# Start Diffusion Policy training
lerobot-train \
  --policy.type diffusion \
  --policy.pretrained_path lerobot/diffusion_policy \
  --dataset.repo_id ${HF_USER}/your_dataset \
  --batch_size 64 \
  --steps 100000 \
  --output_dir outputs/train/diffusion_policy_finetuned \
  --job_name diffusion_policy_finetuning \
  --policy.device cuda \
  --policy.horizon 16 \
  --policy.n_action_steps 8 \
  --policy.n_obs_steps 2 \
  --policy.num_inference_steps 100 \
  --policy.optimizer_lr 1e-4 \
  --policy.optimizer_weight_decay 1e-6 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --save_freq 10000 \
  --wandb.enable true
Advanced Training Configuration
Multi-Step Prediction Configuration
# Configuration for long sequence prediction
lerobot-train \
  --policy.type diffusion \
  --policy.pretrained_path lerobot/diffusion_policy \
  --dataset.repo_id ${HF_USER}/your_dataset \
  --batch_size 32 \
  --steps 150000 \
  --output_dir outputs/train/diffusion_policy_long_horizon \
  --job_name diffusion_policy_long_horizon \
  --policy.device cuda \
  --policy.horizon 32 \
  --policy.n_action_steps 16 \
  --policy.n_obs_steps 4 \
  --policy.num_inference_steps 100 \
  --policy.beta_schedule squaredcos_cap_v2 \
  --policy.clip_sample true \
  --policy.prediction_type epsilon \
  --policy.optimizer_lr 1e-4 \
  --policy.scheduler_name cosine \
  --policy.scheduler_warmup_steps 5000 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true
Memory Optimization Configuration
# For GPUs with smaller VRAM
lerobot-train \
  --policy.type diffusion \
  --policy.pretrained_path lerobot/diffusion_policy \
  --dataset.repo_id ${HF_USER}/your_dataset \
  --batch_size 16 \
  --steps 200000 \
  --output_dir outputs/train/diffusion_policy_memory_opt \
  --job_name diffusion_policy_memory_optimized \
  --policy.device cuda \
  --policy.horizon 16 \
  --policy.n_action_steps 8 \
  --policy.num_inference_steps 50 \
  --policy.optimizer_lr 5e-5 \
  --policy.use_amp true \
  --num_workers 2 \
  --policy.push_to_hub false \
  --save_checkpoint true \
  --wandb.enable true
Parameter Details
Core Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.type | Policy type | diffusion | Diffusion Policy model type | 
--policy.pretrained_path | Pretrained model path | lerobot/diffusion_policy | LeRobot official model (optional) | 
--dataset.repo_id | Dataset repository ID | ${HF_USER}/dataset | Your HuggingFace dataset | 
--batch_size | Batch size | 64 | Adjust based on VRAM, RTX 3080 recommended 32-64 | 
--steps | Training steps | 100000 | Diffusion models typically require more training steps | 
--output_dir | Output directory | outputs/train/diffusion_policy_finetuned | Model save path | 
--job_name | Job name | diffusion_policy_finetuning | For logging and experiment tracking (optional) | 
Diffusion Policy-Specific Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.horizon | Prediction horizon | 16 | Length of predicted action sequence | 
--policy.n_action_steps | Execute action steps | 8 | Number of actions executed each time | 
--policy.n_obs_steps | Observation steps | 2 | Number of historical observations | 
--policy.num_inference_steps | Inference steps | 100 | Number of diffusion sampling steps (not effective during training) | 
--policy.beta_schedule | Noise schedule | squaredcos_cap_v2 | Noise addition scheduling strategy | 
--policy.clip_sample | Sample clipping | true | Whether to clip generated samples | 
--policy.clip_sample_range | Clipping range | 1.0 | Range for sample clipping | 
--policy.prediction_type | Prediction type | epsilon | Predict noise or sample | 
--policy.num_train_timesteps | Training timesteps | 100 | Number of forward diffusion steps | 
Network Architecture Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.vision_backbone | Vision backbone | resnet18 | Image feature extraction network | 
--policy.crop_shape | Image crop size | 84 84 | Crop size for input images | 
--policy.crop_is_random | Random cropping | true | Whether to randomly crop during training | 
--policy.use_group_norm | Use group normalization | true | Replace batch normalization | 
--policy.spatial_softmax_num_keypoints | Spatial softmax keypoints | 32 | Number of keypoints in spatial softmax layer | 
--policy.down_dims | Downsampling dimensions | 512 1024 2048 | Dimensions of U-Net downsampling path | 
--policy.kernel_size | Convolution kernel size | 5 | Kernel size for 1D convolution | 
--policy.n_groups | Group normalization groups | 8 | Number of groups in GroupNorm | 
--policy.diffusion_step_embed_dim | Step embedding dimension | 128 | Embedding dimension for diffusion steps | 
Training Parameters
| Parameter | Meaning | Recommended Value | Description | 
|---|---|---|---|
--policy.optimizer_lr | Learning rate | 1e-4 | Recommended learning rate for diffusion models | 
--policy.optimizer_weight_decay | Weight decay | 1e-6 | Regularization parameter | 
--policy.optimizer_betas | Adam optimizer beta | 0.95 0.999 | Beta parameters for Adam optimizer | 
--policy.optimizer_eps | Adam epsilon | 1e-8 | Numerical stability parameter | 
--policy.scheduler_name | Learning rate scheduler | cosine | Cosine annealing schedule | 
--policy.scheduler_warmup_steps | Warmup steps | 500 | Learning rate warmup | 
--policy.use_amp | Mixed precision | true | Saves VRAM | 
--num_workers | Data loading threads | 4 | Adjust based on CPU core count | 
--policy.push_to_hub | Push to Hub | false | Whether to upload model to HuggingFace (requires repo_id) | 
--save_checkpoint | Save checkpoints | true | Whether to save training checkpoints | 
--save_freq | Save frequency | 10000 | Checkpoint save interval | 
Training Monitoring and Debugging
Weights & Biases Integration
# Detailed W&B configuration
lerobot-train \
  --policy.type diffusion \
  --dataset.repo_id your-name/your-dataset \
  --batch_size 64 \
  --steps 100000 \
  --policy.push_to_hub false \
  --wandb.enable true \
  --wandb.project diffusion_policy_experiments \
  --wandb.notes "Diffusion Policy training with long horizon" \
  # ... other parameters
Key Metrics Monitoring
Metrics to monitor during training:
- Diffusion Loss: Overall loss of the diffusion model
 - MSE Loss: Mean squared error loss
 - Learning Rate: Learning rate change curve
 - Gradient Norm: Gradient norm
 - Inference Time: Inference time
 - Sample Quality: Quality of generated samples
 
Training Visualization
# visualization.py
import torch
import matplotlib.pyplot as plt
import numpy as np
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
def visualize_diffusion_process(model_path, observation):
    # Load model
    policy = DiffusionPolicy.from_pretrained(model_path, device="cuda")
    policy.eval()
    
    # Generate diffusion process of action sequence
    with torch.no_grad():
        # Initial noise
        noise = torch.randn(1, policy.horizon, policy.action_dim, device="cuda")
        
        # Diffusion sampling process
        actions_sequence = []
        for t in range(policy.num_inference_steps):
            # Predict noise
            noise_pred = policy.unet(noise, t, observation)
            
            # Update sample
            noise = policy.scheduler.step(noise_pred, t, noise).prev_sample
            
            # Save intermediate results
            if t % 10 == 0:
                actions_sequence.append(noise.cpu().numpy())
        
        final_actions = noise.cpu().numpy()
    
    # Visualize diffusion process
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    
    for i, actions in enumerate(actions_sequence[:6]):
        ax = axes[i//3, i%3]
        ax.plot(actions[0, :, 0], label='Action Dim 0')
        ax.plot(actions[0, :, 1], label='Action Dim 1')
        ax.set_title(f'Diffusion Step {i*10}')
        ax.legend()
    
    plt.tight_layout()
    plt.savefig('diffusion_process.png')
    plt.show()
    
    return final_actions
if __name__ == "__main__":
    model_path = "outputs/train/diffusion_policy_finetuned/checkpoints/last"
    
    # Simulate observation
    observation = {
        "observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
        "observation.state": torch.randn(1, 7, device="cuda")
    }
    
    actions = visualize_diffusion_process(model_path, observation)
    print(f"Generated actions shape: {actions.shape}")
Model Evaluation
Offline Evaluation
# offline_evaluation.py
import torch
import numpy as np
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from lerobot.datasets.lerobot_dataset import LeRobotDataset
def evaluate_diffusion_policy(model_path, dataset_path):
    # Load model
    policy = DiffusionPolicy.from_pretrained(model_path, device="cuda")
    policy.eval()
    
    # Load test dataset
    dataset = LeRobotDataset(dataset_path, split="test")
    
    total_mse_loss = 0
    total_mae_loss = 0
    num_samples = 0
    
    with torch.no_grad():
        for batch in dataset:
            # Model prediction
            prediction = policy(batch)
            
            # Calculate loss
            target_actions = batch['action']
            predicted_actions = prediction['action']
            
            mse_loss = torch.mean((predicted_actions - target_actions) ** 2)
            mae_loss = torch.mean(torch.abs(predicted_actions - target_actions))
            
            total_mse_loss += mse_loss.item()
            total_mae_loss += mae_loss.item()
            num_samples += 1
    
    avg_mse_loss = total_mse_loss / num_samples
    avg_mae_loss = total_mae_loss / num_samples
    
    print(f"Average MSE Loss: {avg_mse_loss:.4f}")
    print(f"Average MAE Loss: {avg_mae_loss:.4f}")
    
    return avg_mse_loss, avg_mae_loss
def evaluate_action_diversity(model_path, observation, num_samples=10):
    # Evaluate action diversity
    policy = DiffusionPolicy.from_pretrained(model_path, device="cuda")
    policy.eval()
    
    actions_list = []
    
    with torch.no_grad():
        for _ in range(num_samples):
            prediction = policy(observation)
            actions_list.append(prediction['action'].cpu().numpy())
    
    actions_array = np.array(actions_list)  # [num_samples, horizon, action_dim]
    
    # Calculate action diversity metric
    action_std = np.std(actions_array, axis=0)  # [horizon, action_dim]
    avg_std = np.mean(action_std)
    
    print(f"Average action standard deviation: {avg_std:.4f}")
    
    return avg_std, actions_array
if __name__ == "__main__":
    model_path = "outputs/train/diffusion_policy_finetuned/checkpoints/last"
    dataset_path = "path/to/your/test/dataset"
    
    # Offline evaluation
    evaluate_diffusion_policy(model_path, dataset_path)
    
    # Diversity evaluation
    observation = {
        "observation.images.cam_high": torch.randn(1, 3, 224, 224, device="cuda"),
        "observation.state": torch.randn(1, 7, device="cuda")
    }
    
    evaluate_action_diversity(model_path, observation)
Online Evaluation (Robot Environment)
# robot_evaluation.py
import torch
import numpy as np
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
class DiffusionPolicyController:
    def __init__(self, model_path, num_inference_steps=50):
        self.policy = DiffusionPolicy.from_pretrained(model_path, device="cuda")
        self.policy.eval()
        self.num_inference_steps = num_inference_steps
        self.action_queue = []
        self.current_obs_history = []
        
    def get_action(self, observations):
        # Update observation history
        self.current_obs_history.append(observations)
        if len(self.current_obs_history) > self.policy.n_obs_steps:
            self.current_obs_history.pop(0)
        
        # If action queue is empty or needs replanning, generate new action sequence
        if len(self.action_queue) == 0 or self.should_replan():
            with torch.no_grad():
                # Build input
                batch = self.prepare_observation_batch()
                
                # Set inference steps
                self.policy.scheduler.set_timesteps(self.num_inference_steps)
                
                # Generate action sequence
                prediction = self.policy(batch)
                actions = prediction['action'].cpu().numpy()[0]  # [horizon, action_dim]
                
                # Update action queue
                self.action_queue = list(actions[:self.policy.n_action_steps])
        
        # Return next action
        return self.action_queue.pop(0)
    
    def should_replan(self):
        # Simple replanning strategy: replan when action queue has less than half remaining
        return len(self.action_queue) < self.policy.n_action_steps // 2
    
    def prepare_observation_batch(self):
        batch = {}
        
        # Process image observations
        if "observation.images.cam_high" in self.current_obs_history[-1]:
            images = []
            for obs in self.current_obs_history:
                image = obs["observation.images.cam_high"]
                image_tensor = self.preprocess_image(image)
                images.append(image_tensor)
            
            # If history is insufficient, repeat last observation
            while len(images) < self.policy.n_obs_steps:
                images.insert(0, images[0])
            
            batch["observation.images.cam_high"] = torch.stack(images).unsqueeze(0)
        
        # Process state observations
        if "observation.state" in self.current_obs_history[-1]:
            states = []
            for obs in self.current_obs_history:
                state = torch.tensor(obs["observation.state"], dtype=torch.float32)
                states.append(state)
            
            # If history is insufficient, repeat last state
            while len(states) < self.policy.n_obs_steps:
                states.insert(0, states[0])
            
            batch["observation.state"] = torch.stack(states).unsqueeze(0)
        
        return batch
    
    def preprocess_image(self, image):
        # Image preprocessing logic
        image_tensor = torch.tensor(image).permute(2, 0, 1).float() / 255.0
        return image_tensor
# Usage example
if __name__ == "__main__":
    controller = DiffusionPolicyController(
        model_path="outputs/train/diffusion_policy_finetuned/checkpoints/last",
        num_inference_steps=50
    )
    
    # Simulate robot control loop
    for step in range(100):
        # Get current observation
        observations = {
            "observation.images.cam_high": np.random.randint(0, 255, (224, 224, 3)),
            "observation.state": np.random.randn(7)
        }
        
        # Get action
        action = controller.get_action(observations)
        
        # Execute action
        print(f"Step {step}: Action = {action}")
        
        # This should send the action to the actual robot
        # robot.execute_action(action)
Deployment and Optimization
Inference Acceleration
# fast_inference.py
import torch
from lerobot.policies.diffusion.modeling_diffusion import DiffusionPolicy
from diffusers import DDIMScheduler
class FastDiffusionInference:
    def __init__(self, model_path, num_inference_steps=10):
        self.policy = DiffusionPolicy.from_pretrained(model_path, device="cuda")
        self.policy.eval()
        
        # Use DDIM scheduler for fast sampling
        self.policy.scheduler = DDIMScheduler.from_config(self.policy.scheduler.config)
        self.num_inference_steps = num_inference_steps
        
        # Warmup model
        self.warmup()
    
    def warmup(self):
        # Warmup model with dummy data
        dummy_batch = {
            "observation.images.cam_high": torch.randn(1, 2, 3, 224, 224, device="cuda"),
            "observation.state": torch.randn(1, 2, 7, device="cuda")
        }
        
        with torch.no_grad():
            for _ in range(5):
                _ = self.predict(dummy_batch)
    
    @torch.no_grad()
    def predict(self, observations):
        # Set inference steps
        self.policy.scheduler.set_timesteps(self.num_inference_steps)
        
        # Fast inference
        prediction = self.policy(observations)
        return prediction['action'].cpu().numpy()
if __name__ == "__main__":
    fast_inference = FastDiffusionInference(
        "outputs/train/diffusion_policy_finetuned/checkpoints/last",
        num_inference_steps=10
    )
    
    # Test inference speed
    import time
    
    observations = {
        "observation.images.cam_high": torch.randn(1, 2, 3, 224, 224, device="cuda"),
        "observation.state": torch.randn(1, 2, 7, device="cuda")
    }
    
    start_time = time.time()
    for _ in range(100):
        action = fast_inference.predict(observations)
    end_time = time.time()
    
    avg_inference_time = (end_time - start_time) / 100
    print(f"Average inference time: {avg_inference_time:.4f} seconds")
    print(f"Inference frequency: {1/avg_inference_time:.2f} Hz")
Best Practices
Data Collection Recommendations
- Smooth Actions: Ensure action sequences in demonstration data are smooth and continuous
 - Diverse Scenarios: Collect data with different starting states and goals
 - High-Quality Annotations: Ensure accuracy of action annotations
 - Sufficient Data Volume: Diffusion models typically require more data
 
Training Optimization Recommendations
- Noise Scheduling: Choose appropriate noise scheduling strategy
 - Inference Steps: Balance quality and speed, choose appropriate inference steps
 - Learning Rate Scheduling: Use cosine annealing or step decay
 - Regularization: Appropriately use weight decay
 
Deployment Optimization Recommendations
- Fast Sampling: Use DDIM or other fast sampling methods
 - Model Compression: Use knowledge distillation or quantization techniques
 - Parallel Inference: Utilize GPU parallel capabilities
 - Cache Optimization: Cache intermediate computation results
 
Frequently Asked Questions (FAQ)
Q: What advantages does Diffusion Policy have compared to other policy learning methods?
A: Main advantages of Diffusion Policy include:
- Multimodal Generation: Can handle tasks with multiple solutions
 - High-Quality Output: Generates smooth and natural action sequences
 - Strong Robustness: Good robustness to noise and perturbations
 - Strong Expressiveness: Can learn complex action distributions
 
Q: How to choose the appropriate number of inference steps?
A: The choice of inference steps needs to balance quality and speed:
- High quality: 100-1000 steps, suitable for offline evaluation
 - Real-time applications: 10-50 steps, suitable for online control
 - Fast prototyping: 5-10 steps, suitable for quick testing
 
Q: How long does training take?
A: Training time depends on multiple factors:
- Dataset size: 500 episodes take approximately 12-24 hours (RTX 3080)
 - Model complexity: Larger models require more time
 - Inference steps: More steps increase training time
 - Convergence requirement: Typically requires 100000-200000 steps
 
Q: How to improve the quality of generated actions?
A: Methods to improve action quality:
- Increase inference steps: More steps typically produce better results
 - Optimize noise scheduling: Choose appropriate noise addition strategy
 - Data quality: Ensure high quality of training data
 - Model architecture: Use larger or deeper networks
 - Regularization techniques: Appropriate regularization prevents overfitting
 
Q: How to handle real-time requirements?
A: Methods to meet real-time requirements:
- Fast sampling: Use DDIM or DPM-Solver
 - Reduce inference steps: Find balance between quality and speed
 - Model distillation: Train smaller student models
 - Parallel inference: Utilize multi-GPU or batching
 - Pre-computation: Pre-compute partial results
 
Related Resources
- Diffusion Policy Original Paper
 - LeRobot Diffusion Policy Implementation
 - Diffusers Library Documentation
 - Diffusion Models Tutorial
 - Robot Learning Course
 
Changelog
- 2024-01: Initial version release
 - 2024-02: Added fast sampling support
 - 2024-03: Optimized memory usage and training efficiency
 - 2024-04: Added diversity evaluation and deployment optimization