ACT training

ACT (Action Chunking Transformer) comes from the ALOHA line of work. The reference implementation is tonyzhaozh/act; the paper is Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware. Upstream ACT consumes HDF5 episodes.

IO-AI publishes ioaitech/train_act:cuda, a Docker image built on upstream ACT. It includes a CUDA training stack, ACT dependencies, a LeRobot → HDF5 episode conversion layer, and the standard ACT training entrypoint. You can mount a LeRobot dataset directly—no manual HDF5 prep.

When to use

ACT fits imitation learning with clear task boundaries and relatively stable action patterns. It does not rely on a large language model, so the loop is simpler than Pi0/Pi0.5; trade-off: cross-task generalization is usually weaker than VLA baselines.

Prefer ACT when:

Data comes from one robot and one task (or a small family of related tasks).
You want a deployable single-task policy quickly.
Language is not the main varying factor; task names are mostly for bookkeeping.
You will evaluate multiple checkpoints on hardware, not only the final loss.

The upstream ACT README notes real-world tasks can keep improving after the loss plateaus—extend training and compare checkpoints.

Data requirements

The image expects a LeRobot dataset mounted at /data/input, minimally:

your_dataset/
├── meta/
│   └── info.json
├── data/
└── videos/

The converter reads:

observation.state → HDF5 observations/qpos (pad to 16 dims with zeros, truncate if longer).
action → HDF5 action (same 16-dim rule).
observation.images.* → inferred camera keys, or pass --camera_keys explicitly.

If your state is not 16-D, verify padding/truncation matches your downstream controller. The examples target “training runs end-to-end,” not “any robot deploys out of the box.”

Training with the image

GPU smoke test:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Smoke test (two episodes)

docker run --rm --gpus all \
  -v /path/to/lerobot_dataset:/data/input \
  -v /path/to/output:/data/output \
  ioaitech/train_act:cuda \
  --run_name act_smoke \
  --task_name smoke_task \
  --num_epochs 1 \
  --batch_size 8 \
  --max_episodes 2

Full training template

docker run --rm --gpus all \
  -v /path/to/lerobot_dataset:/data/input \
  -v /path/to/output:/data/output \
  ioaitech/train_act:cuda \
  --run_name pick_block_act_v1 \
  --task_name pick_block \
  --num_epochs 12000 \
  --batch_size 64 \
  --learning_rate 5e-5 \
  --chunk_size 100 \
  --kl_weight 10 \
  --hidden_dim 512 \
  --dim_feedforward 3200 \
  --batch_mode fixed_global

Multi-camera example with explicit keys:

docker run --rm --gpus all \
  -v /path/to/lerobot_dataset:/data/input \
  -v /path/to/output:/data/output \
  ioaitech/train_act:cuda \
  --run_name multi_camera_act \
  --task_name pick_block \
  --camera_keys observation.images.front,observation.images.left_wrist,observation.images.right_wrist \
  --camera_names front,left_wrist,right_wrist \
  --num_epochs 12000 \
  --batch_size 64

Output layout:

/path/to/output/
├── checkpoints/
│   └── <run_name>/
│       ├── policy_best.ckpt
│       ├── policy_last.ckpt
│       └── dataset_stats.pkl
└── manifest.json

Add --keep_converted_hdf5 to retain intermediate HDF5 for debugging. By default, /data/output/converted_hdf5 is cleaned after training.

Common flags

Flag	Default	Meaning
`--batch_size`	`64`	Reference batch; semantics depend on `--batch_mode` for multi-GPU.
`--batch_mode`	`fixed_global`	Keep global batch near the single-GPU reference; use `fixed_per_gpu` for throughput tuning.
`--num_epochs`	`12000`	Main ACT epochs.
`--steps`	`0`	Alias for epochs when `num_epochs <= 0`.
`--learning_rate`	`5e-5`	Main LR.
`--save_interval`	`6000`	Primarily affects manifest writes; checkpoints follow internal ACT logic.
`--policy_class`	`ACT`	Usually leave default.
`--kl_weight`	`10`	Common upstream value.
`--chunk_size`	`100`	Action chunk length / ACT query length.
`--hidden_dim`	`512`	Transformer width.
`--dim_feedforward`	`3200`	FFN width.
`--camera_keys`	auto	Comma-separated LeRobot image keys.
`--camera_names`	auto	ACT-side camera names; order must match `camera_keys`.
`--max_episodes`	`0`	Convert only the first N episodes; `0` means all.
`--convert_workers`	`0`	Parallel conversion workers; lower (e.g. `4`/`8`) if RAM is tight.
`--num_workers`	`0`	Training DataLoader workers; keep default in containers.

Reproduce from upstream ACT

Upstream ACT expects HDF5, not raw LeRobot folders. Minimal clone and env:

git clone https://github.com/tonyzhaozh/act.git
cd act

conda create -n aloha python=3.8.10
conda activate aloha
pip install torchvision torch pyquaternion pyyaml rospkg pexpect
pip install mujoco==2.3.7 dm_control==1.0.14 opencv-python matplotlib einops packaging h5py ipython
cd detr && pip install -e .

Official training example:

python3 imitate_episodes.py \
  --task_name sim_transfer_cube_scripted \
  --ckpt_dir /path/to/ckpt \
  --policy_class ACT \
  --kl_weight 10 \
  --chunk_size 100 \
  --hidden_dim 512 \
  --batch_size 8 \
  --dim_feedforward 3200 \
  --num_epochs 2000 \
  --lr 1e-5 \
  --seed 0

To reproduce upstream ACT from platform-exported LeRobot data, you must convert to official HDF5 episodes first. ioaitech/train_act:cuda is the supported path for conversion + training; if you stay purely in the upstream repo, implement an equivalent converter yourself.

Troubleshooting

Missing camera keys

Inspect features in meta/info.json. If auto-inference is wrong, pass --camera_keys and --camera_names with matching counts and order.

Multi-GPU hangs / NCCL timeouts

Validate on single GPU with a few episodes first. Keep --num_workers 0 for multi-GPU and lower --convert_workers if conversion stresses disk/RAM.

Loss plateaus but hardware motion is poor

Upstream experience: keep training and compare multiple checkpoints. Use a fixed real-world eval harness—not only policy_last.ckpt.

You already have official HDF5

Skip LeRobot conversion; call imitate_episodes.py directly or adapt your own driver. That path is outside the default platform wrapper.

When to use​

Data requirements​

Training with the image​

Smoke test (two episodes)​

Full training template​

Common flags​

Reproduce from upstream ACT​

Troubleshooting​

Missing camera keys​

Multi-GPU hangs / NCCL timeouts​

Loss plateaus but hardware motion is poor​

You already have official HDF5​

References​