Skip to main content

ACT model training guide

This guide explains how to train ACT with the Docker image ioaitech/train_act:cuda published by IO-AI.TECH. Mount paths, argument names, and defaults match the training scripts inside the image.

Docker image registry

Images are published on Docker Hub under the ioaitech organization (e.g. ioaitech/train_act:cuda).

When to use ACT

ACT fits imitation-learning setups where the task boundary is clear and the action pattern is relatively stable. If your first goal is to get a single-task training loop working reliably and then tune hyperparameters, ACT remains a practical choice.

This guide assumes you have already labeled data on the EmbodiFlow platform and exported it in LeRobot format.

One-command training

Prerequisites

  • Linux host
  • Working NVIDIA driver
  • Docker installed
  • docker run --gpus all can see the GPU

Quick check:

docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi

Minimal run

Mount your local LeRobot dataset at /data/input and outputs at /data/output:

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name act_demo \
--task_name demo_task \
--num_epochs 1 \
--batch_size 8

Use this to verify mounts, format detection, and the training path end to end. After that, increase epochs and adjust hyperparameters for a real run.

This template is closer to a typical production-style run:

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name lemon_act_v1 \
--task_name pick_lemon \
--num_epochs 12000 \
--batch_size 64 \
--learning_rate 5e-5 \
--chunk_size 100 \
--kl_weight 10 \
--hidden_dim 512 \
--dim_feedforward 3200 \
--batch_mode fixed_global \
--gpus all

To pin specific GPUs, change the last line to e.g. --gpus 0 or --gpus 0,1.

Data requirements

The container checks for /data/input/meta/info.json at startup; if it is missing, the run exits immediately. Your dataset root should look like:

your_dataset/
├── meta/
│ └── info.json
├── data/
└── videos/

The training entrypoint supports LeRobot v2 and v3 datasets. It detects the version and applies compatibility handling and intermediate conversion when needed.

Camera fields

If image feature names follow common conventions, camera keys are inferred from meta/info.json. For unusual naming, pass them explicitly:

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name multi_cam_exp \
--task_name tron2_task \
--camera_keys observation.images.cam_high,observation.images.cam_right_wrist,observation.images.cam_left_wrist \
--camera_names cam_high,cam_right_wrist,cam_left_wrist
  • --camera_keys: LeRobot image feature keys
  • --camera_names: ACT-side camera names

The two lists must stay aligned in order.

Common arguments

The following tables match the argument definitions in train_lerobot_to_act.py inside the image.

Core training

ArgumentDefaultDescription
--batch_size64Training batch size
--num_epochs12000Number of training epochs
--steps0Alias for num_epochs; used only when num_epochs=0
--learning_rate5e-5Main learning rate
--save_interval6000Checkpoint save interval (epochs)
--gpusallAll GPUs, or a list like 0,1
--batch_modefixed_globalMulti-GPU global batch semantics closer to single-GPU reference
--num_workers0Recommended in containers to reduce /dev/shm pressure

ACT model

ArgumentDefaultDescription
--task_nameautoInfer primary task from dataset; falls back on failure
--run_nameautoCheckpoint subdirectory name
--policy_classACTUsually leave default
--kl_weight10KL loss weight
--chunk_size100Action chunk length
--hidden_dim512Transformer hidden size
--dim_feedforward3200FFN hidden size
--seed42Random seed

Data bridge

ArgumentDefaultDescription
--camera_keysinferredLeRobot image keys
--camera_namesderivedACT camera names
--episode_len0Force episode length; 0 = auto
--idle_threshold1e-4Idle-frame filter threshold
--max_episodes0Convert only first N episodes (smoke test)
--convert_workers0Parallel workers for conversion
--keep_converted_hdf5offKeep intermediate HDF5 for debugging

Outputs

Artifacts are written under the mounted /data/output:

/path/to/output/
├── checkpoints/
│ └── <run_name>/
│ ├── policy_best.ckpt
│ ├── policy_last.ckpt
│ └── dataset_stats.pkl
└── manifest.json
  • policy_best.ckpt: best checkpoint during training
  • policy_last.ckpt: last saved checkpoint
  • dataset_stats.pkl: statistics used for training
  • manifest.json: run metadata

Multi-GPU notes

Multi-GPU launch is handled inside the container; you normally do not need to hand-roll torchrun. Suggested practice:

  • Prefer --batch_mode fixed_global for easier comparison with single-GPU runs
  • Keep --num_workers 0 in containers
  • For a first multi-GPU test, add --max_episodes 10

Switch to --batch_mode fixed_per_gpu only when you intentionally prioritize throughput.

FAQ

1. Container says dataset not found

Check:

  • Host path is mounted to /data/input
  • /data/input/meta/info.json exists

Missing info.json usually means a wrong path or directory layout, not a bug in the trainer.

2. DataLoader errors or NCCL timeouts on multi-GPU

Try:

  • Keep --num_workers 0
  • Lower --convert_workers to 2 or 4
  • Shorten the pipeline with --max_episodes for a dry run

3. Choosing task_name

If task metadata is complete, --task_name auto is usually enough. For complex task definitions, set an explicit name to keep outputs and logs organized.

4. First run feels slow

ResNet18 weights are baked into the image, but conversion and the first data pass still take time. Steady log progress is expected.

Practical tips

  • Validate the pipeline with 1 epoch or a small --max_episodes before a long run
  • Keep a fixed baseline config and change one knob at a time
  • Compare several checkpoints, not only the last one

References