ACT model training guide
This guide explains how to train ACT with the Docker image ioaitech/train_act:cuda published by IO-AI.TECH. Mount paths, argument names, and defaults match the training scripts inside the image.
Images are published on Docker Hub under the ioaitech organization (e.g. ioaitech/train_act:cuda).
When to use ACT
ACT fits imitation-learning setups where the task boundary is clear and the action pattern is relatively stable. If your first goal is to get a single-task training loop working reliably and then tune hyperparameters, ACT remains a practical choice.
This guide assumes you have already labeled data on the EmbodiFlow platform and exported it in LeRobot format.
One-command training
Prerequisites
- Linux host
- Working NVIDIA driver
- Docker installed
docker run --gpus allcan see the GPU
Quick check:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Minimal run
Mount your local LeRobot dataset at /data/input and outputs at /data/output:
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name act_demo \
--task_name demo_task \
--num_epochs 1 \
--batch_size 8
Use this to verify mounts, format detection, and the training path end to end. After that, increase epochs and adjust hyperparameters for a real run.
Recommended one-shot script
This template is closer to a typical production-style run:
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name lemon_act_v1 \
--task_name pick_lemon \
--num_epochs 12000 \
--batch_size 64 \
--learning_rate 5e-5 \
--chunk_size 100 \
--kl_weight 10 \
--hidden_dim 512 \
--dim_feedforward 3200 \
--batch_mode fixed_global \
--gpus all
To pin specific GPUs, change the last line to e.g. --gpus 0 or --gpus 0,1.
Data requirements
The container checks for /data/input/meta/info.json at startup; if it is missing, the run exits immediately. Your dataset root should look like:
your_dataset/
├── meta/
│ └── info.json
├── data/
└── videos/
The training entrypoint supports LeRobot v2 and v3 datasets. It detects the version and applies compatibility handling and intermediate conversion when needed.
Camera fields
If image feature names follow common conventions, camera keys are inferred from meta/info.json. For unusual naming, pass them explicitly:
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input \
-v /path/to/output:/data/output \
ioaitech/train_act:cuda \
--run_name multi_cam_exp \
--task_name tron2_task \
--camera_keys observation.images.cam_high,observation.images.cam_right_wrist,observation.images.cam_left_wrist \
--camera_names cam_high,cam_right_wrist,cam_left_wrist
--camera_keys: LeRobot image feature keys--camera_names: ACT-side camera names
The two lists must stay aligned in order.
Common arguments
The following tables match the argument definitions in train_lerobot_to_act.py inside the image.
Core training
| Argument | Default | Description |
|---|---|---|
--batch_size | 64 | Training batch size |
--num_epochs | 12000 | Number of training epochs |
--steps | 0 | Alias for num_epochs; used only when num_epochs=0 |
--learning_rate | 5e-5 | Main learning rate |
--save_interval | 6000 | Checkpoint save interval (epochs) |
--gpus | all | All GPUs, or a list like 0,1 |
--batch_mode | fixed_global | Multi-GPU global batch semantics closer to single-GPU reference |
--num_workers | 0 | Recommended in containers to reduce /dev/shm pressure |
ACT model
| Argument | Default | Description |
|---|---|---|
--task_name | auto | Infer primary task from dataset; falls back on failure |
--run_name | auto | Checkpoint subdirectory name |
--policy_class | ACT | Usually leave default |
--kl_weight | 10 | KL loss weight |
--chunk_size | 100 | Action chunk length |
--hidden_dim | 512 | Transformer hidden size |
--dim_feedforward | 3200 | FFN hidden size |
--seed | 42 | Random seed |
Data bridge
| Argument | Default | Description |
|---|---|---|
--camera_keys | inferred | LeRobot image keys |
--camera_names | derived | ACT camera names |
--episode_len | 0 | Force episode length; 0 = auto |
--idle_threshold | 1e-4 | Idle-frame filter threshold |
--max_episodes | 0 | Convert only first N episodes (smoke test) |
--convert_workers | 0 | Parallel workers for conversion |
--keep_converted_hdf5 | off | Keep intermediate HDF5 for debugging |
Outputs
Artifacts are written under the mounted /data/output:
/path/to/output/
├── checkpoints/
│ └── <run_name>/
│ ├── policy_best.ckpt
│ ├── policy_last.ckpt
│ └── dataset_stats.pkl
└── manifest.json
policy_best.ckpt: best checkpoint during trainingpolicy_last.ckpt: last saved checkpointdataset_stats.pkl: statistics used for trainingmanifest.json: run metadata
Multi-GPU notes
Multi-GPU launch is handled inside the container; you normally do not need to hand-roll torchrun. Suggested practice:
- Prefer
--batch_mode fixed_globalfor easier comparison with single-GPU runs - Keep
--num_workers 0in containers - For a first multi-GPU test, add
--max_episodes 10
Switch to --batch_mode fixed_per_gpu only when you intentionally prioritize throughput.
FAQ
1. Container says dataset not found
Check:
- Host path is mounted to
/data/input /data/input/meta/info.jsonexists
Missing info.json usually means a wrong path or directory layout, not a bug in the trainer.
2. DataLoader errors or NCCL timeouts on multi-GPU
Try:
- Keep
--num_workers 0 - Lower
--convert_workersto2or4 - Shorten the pipeline with
--max_episodesfor a dry run
3. Choosing task_name
If task metadata is complete, --task_name auto is usually enough. For complex task definitions, set an explicit name to keep outputs and logs organized.
4. First run feels slow
ResNet18 weights are baked into the image, but conversion and the first data pass still take time. Steady log progress is expected.
Practical tips
- Validate the pipeline with 1 epoch or a small
--max_episodesbefore a long run - Keep a fixed baseline config and change one knob at a time
- Compare several checkpoints, not only the last one