ACT training
ACT (Action Chunking Transformer) comes from the ALOHA line of work. The reference implementation is tonyzhaozh/act; the paper is Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. Upstream ACT consumes HDF5 episodes.
IO-AI publishes ioaitech/train_act:cuda, a Docker image built on upstream ACT. It includes a CUDA training stack, ACT dependencies, a LeRobot → HDF5 episode conversion layer, and the standard ACT training entrypoint. You can mount a LeRobot dataset directly—no manual HDF5 prep.
When to use
ACT fits imitation learning with clear task boundaries and relatively stable action patterns. It does not rely on a large language model, so the loop is simpler than Pi0/Pi0.5; trade-off: cross-task generalization is usually weaker than VLA baselines.
Prefer ACT when:
- Data comes from one robot and one task (or a small family of related tasks).
- You want a deployable single-task policy quickly.
- Language is not the main varying factor; task names are mostly for bookkeeping.
- You will evaluate multiple checkpoints on hardware, not only the final loss.
The upstream ACT README notes real-world tasks can keep improving after the loss plateaus—extend training and compare checkpoints.
Data requirements
The image expects a LeRobot dataset mounted at /data/input, minimally:
your_dataset/
├── meta/
│ └── info.json
├── data/
└── videos/
The converter reads:
observation.state→ HDF5observations/qpos(pad to 16 dims with zeros, truncate if longer).action→ HDF5action(same 16-dim rule).observation.images.*→ inferred camera keys, or pass--camera_keysexplicitly.
If your state is not 16-D, verify padding/truncation matches your downstream controller. The examples target “training runs end-to-end,” not “any robot deploys out of the box.”
Training with the image
GPU smoke test:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Smoke test (two episodes)
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name act_smoke \
--task_name smoke_task \
--num_epochs 1 \
--batch_size 8 \
--max_episodes 2
Full training template
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name pick_block_act_v1 \
--task_name pick_block \
--num_epochs 12000 \
--batch_size 64 \
--learning_rate 5e-5 \
--chunk_size 100 \
--kl_weight 10 \
--hidden_dim 512 \
--dim_feedforward 3200 \
--batch_mode fixed_global
Multi-camera example with explicit keys:
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name multi_camera_act \
--task_name pick_block \
--camera_keys observation.images.front,observation.images.left_wrist,observation.images.right_wrist \
--camera_names front,left_wrist,right_wrist \
--num_epochs 12000 \
--batch_size 64
Output layout:
/path/to/output/
├── checkpoints/
│ └── <run_name>/
│ ├── policy_best.ckpt
│ ├── policy_last.ckpt
│ └── dataset_stats.pkl
└── manifest.json
Add --keep_converted_hdf5 to retain intermediate HDF5 for debugging. By default, /data/output/converted_hdf5 is cleaned after training.
Common flags
| Flag | Default | Meaning |
|---|---|---|
--batch_size | 64 | Reference batch; semantics depend on --batch_mode for multi-GPU. |
--batch_mode | fixed_global | Keep global batch near the single-GPU reference; use fixed_per_gpu for throughput tuning. |
--num_epochs | 12000 | Main ACT epochs. |
--steps | 0 | Alias for epochs when num_epochs <= 0. |
--learning_rate | 5e-5 | Main LR. |
--save_interval | 6000 | Primarily affects manifest writes; checkpoints follow internal ACT logic. |
--policy_class | ACT | Usually leave default. |
--kl_weight | 10 | Common upstream value. |
--chunk_size | 100 | Action chunk length / ACT query length. |
--hidden_dim | 512 | Transformer width. |
--dim_feedforward | 3200 | FFN width. |
--camera_keys | auto | Comma-separated LeRobot image keys. |
--camera_names | auto | ACT-side camera names; order must match camera_keys. |
--max_episodes | 0 | Convert only the first N episodes; 0 means all. |
--convert_workers | 0 | Parallel conversion workers; lower (e.g. 4/8) if RAM is tight. |
--num_workers | 0 | Training DataLoader workers; keep default in containers. |
Reproduce from upstream ACT
Upstream ACT expects HDF5, not raw LeRobot folders. Minimal clone and env:
git clone https://github.com/tonyzhaozh/act.git
cd act
conda create -n aloha python=3.8.10
conda activate aloha
pip install torchvision torch pyquaternion pyyaml rospkg pexpect
pip install mujoco==2.3.7 dm_control==1.0.14 opencv-python matplotlib einops packaging h5py ipython
cd detr && pip install -e .
Official training example:
python3 imitate_episodes.py \
--task_name sim_transfer_cube_scripted \
--ckpt_dir /path/to/ckpt \
--policy_class ACT \
--kl_weight 10 \
--chunk_size 100 \
--hidden_dim 512 \
--batch_size 8 \
--dim_feedforward 3200 \
--num_epochs 2000 \
--lr 1e-5 \
--seed 0
To reproduce upstream ACT from platform-exported LeRobot data, you must convert to official HDF5 episodes first. ioaitech/train_act:cuda is the supported path for conversion + training; if you stay purely in the upstream repo, implement an equivalent converter yourself.
Troubleshooting
Missing camera keys
Inspect features in meta/info.json. If auto-inference is wrong, pass --camera_keys and --camera_names with matching counts and order.
Multi-GPU hangs / NCCL timeouts
Validate on single GPU with a few episodes first. Keep --num_workers 0 for multi-GPU and lower --convert_workers if conversion stresses disk/RAM.
Loss plateaus but hardware motion is poor
Upstream experience: keep training and compare multiple checkpoints. Use a fixed real-world eval harness—not only policy_last.ckpt.
You already have official HDF5
Skip LeRobot conversion; call imitate_episodes.py directly or adapt your own driver. That path is outside the default platform wrapper.