Skip to main content

ACT training

ACT (Action Chunking Transformer) comes from the ALOHA line of work. The reference implementation is tonyzhaozh/act; the paper is Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. Upstream ACT consumes HDF5 episodes.

IO-AI publishes ioaitech/train_act:cuda, a Docker image built on upstream ACT. It includes a CUDA training stack, ACT dependencies, a LeRobot → HDF5 episode conversion layer, and the standard ACT training entrypoint. You can mount a LeRobot dataset directly—no manual HDF5 prep.

When to use

ACT fits imitation learning with clear task boundaries and relatively stable action patterns. It does not rely on a large language model, so the loop is simpler than Pi0/Pi0.5; trade-off: cross-task generalization is usually weaker than VLA baselines.

Prefer ACT when:

  • Data comes from one robot and one task (or a small family of related tasks).
  • You want a deployable single-task policy quickly.
  • Language is not the main varying factor; task names are mostly for bookkeeping.
  • You will evaluate multiple checkpoints on hardware, not only the final loss.

The upstream ACT README notes real-world tasks can keep improving after the loss plateaus—extend training and compare checkpoints.

Data requirements

The image expects a LeRobot dataset mounted at /data/input, minimally:

your_dataset/
├── meta/
│ └── info.json
├── data/
└── videos/

The converter reads:

  • observation.state → HDF5 observations/qpos (pad to 16 dims with zeros, truncate if longer).
  • action → HDF5 action (same 16-dim rule).
  • observation.images.* → inferred camera keys, or pass --camera_keys explicitly.

If your state is not 16-D, verify padding/truncation matches your downstream controller. The examples target “training runs end-to-end,” not “any robot deploys out of the box.”

Training with the image

GPU smoke test:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Smoke test (two episodes)

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name act_smoke \
--task_name smoke_task \
--num_epochs 1 \
--batch_size 8 \
--max_episodes 2

Full training template

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name pick_block_act_v1 \
--task_name pick_block \
--num_epochs 12000 \
--batch_size 64 \
--learning_rate 5e-5 \
--chunk_size 100 \
--kl_weight 10 \
--hidden_dim 512 \
--dim_feedforward 3200 \
--batch_mode fixed_global

Multi-camera example with explicit keys:

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name multi_camera_act \
--task_name pick_block \
--camera_keys observation.images.front,observation.images.left_wrist,observation.images.right_wrist \
--camera_names front,left_wrist,right_wrist \
--num_epochs 12000 \
--batch_size 64

Output layout:

/path/to/output/
├── checkpoints/
│ └── <run_name>/
│ ├── policy_best.ckpt
│ ├── policy_last.ckpt
│ └── dataset_stats.pkl
└── manifest.json

Add --keep_converted_hdf5 to retain intermediate HDF5 for debugging. By default, /data/output/converted_hdf5 is cleaned after training.

Common flags

FlagDefaultMeaning
--batch_size64Reference batch; semantics depend on --batch_mode for multi-GPU.
--batch_modefixed_globalKeep global batch near the single-GPU reference; use fixed_per_gpu for throughput tuning.
--num_epochs12000Main ACT epochs.
--steps0Alias for epochs when num_epochs <= 0.
--learning_rate5e-5Main LR.
--save_interval6000Primarily affects manifest writes; checkpoints follow internal ACT logic.
--policy_classACTUsually leave default.
--kl_weight10Common upstream value.
--chunk_size100Action chunk length / ACT query length.
--hidden_dim512Transformer width.
--dim_feedforward3200FFN width.
--camera_keysautoComma-separated LeRobot image keys.
--camera_namesautoACT-side camera names; order must match camera_keys.
--max_episodes0Convert only the first N episodes; 0 means all.
--convert_workers0Parallel conversion workers; lower (e.g. 4/8) if RAM is tight.
--num_workers0Training DataLoader workers; keep default in containers.

Reproduce from upstream ACT

Upstream ACT expects HDF5, not raw LeRobot folders. Minimal clone and env:

git clone https://github.com/tonyzhaozh/act.git
cd act

conda create -n aloha python=3.8.10
conda activate aloha
pip install torchvision torch pyquaternion pyyaml rospkg pexpect
pip install mujoco==2.3.7 dm_control==1.0.14 opencv-python matplotlib einops packaging h5py ipython
cd detr && pip install -e .

Official training example:

python3 imitate_episodes.py \
--task_name sim_transfer_cube_scripted \
--ckpt_dir /path/to/ckpt \
--policy_class ACT \
--kl_weight 10 \
--chunk_size 100 \
--hidden_dim 512 \
--batch_size 8 \
--dim_feedforward 3200 \
--num_epochs 2000 \
--lr 1e-5 \
--seed 0

To reproduce upstream ACT from platform-exported LeRobot data, you must convert to official HDF5 episodes first. ioaitech/train_act:cuda is the supported path for conversion + training; if you stay purely in the upstream repo, implement an equivalent converter yourself.

Troubleshooting

Missing camera keys

Inspect features in meta/info.json. If auto-inference is wrong, pass --camera_keys and --camera_names with matching counts and order.

Multi-GPU hangs / NCCL timeouts

Validate on single GPU with a few episodes first. Keep --num_workers 0 for multi-GPU and lower --convert_workers if conversion stresses disk/RAM.

Loss plateaus but hardware motion is poor

Upstream experience: keep training and compare multiple checkpoints. Use a fixed real-world eval harness—not only policy_last.ckpt.

You already have official HDF5

Skip LeRobot conversion; call imitate_episodes.py directly or adapt your own driver. That path is outside the default platform wrapper.

References