Spirit-v1.5 fine-tuning
Spirit-v1.5 is Spirit AI’s open VLA model. Upstream code lives at Spirit-AI-Team/spirit-v1.5; the model card is Spirit-AI-robotics/Spirit-v1.5.
IO-AI publishes ioaitech/train_spirit:1.5, a Docker image built on that upstream tree. It includes the Spirit-v1.5 training environment, Hugging Face base weights, a fine-tuning entrypoint, and LeRobot → RoboChallenge dataset layout conversion. When your data satisfies the constraints below, you can mount either a native RoboChallenge tree or a compatible LeRobot export and start fine-tuning.
The model card describes a Qwen/Qwen3-VL-4B-Instruct backbone with a DiT (Diffusion Transformer) action head. Upstream ships train.py and scripts/run_finetune.sh; the README notes fine-tuning code as of 2026-04.
Default image on Docker Hub: ioaitech/train_spirit:1.5.
Scope
Spirit-v1.5 is a frontier model—not your first training baseline. Validate data quality with ACT or SmolVLA before committing here.
It is a reasonable choice when:
- You want a large VLA backbone and have multi-GPU capacity.
- Your data matches RoboChallenge-style state and camera layout.
- Task text and continuous actions are clean and consistent.
- You accept strict schema requirements and higher training cost.
Upstream recommends multi-GPU training and cites NVIDIA A100 80GB in their test notes. Consumer single-GPU runs are useful for reading code and tiny smoke tests, not as a promise of full fine-tuning.
Data requirements
Upstream training data follows the RoboChallenge layout:
dataset/
├── meta/
│ └── task_info.json
└── data/
└── episode_000000/
├── states/
│ └── states.jsonl
└── videos/
├── handeye_realsense_rgb.mp4
├── main_realsense_rgb.mp4
└── side_realsense_rgb.mp4
ioaitech/train_spirit:1.5 accepts:
- Native RoboChallenge datasets (train directly).
- Compatible LeRobot datasets (converted to RoboChallenge inside the container).
LeRobot conversion is strict. You need:
observation.statewith ≥8 dims: first 7 = end-effector pose, 8th = gripper width.- Three RGB video streams mapped to
main_realsense_rgb,handeye_realsense_rgb,side_realsense_rgb. meta/tasks.jsonlor CLI task text to buildtask_info.json.
If your LeRobot state is joint angles instead of an end-effector pose, do not train Spirit-v1.5 until you transform the state space.
Training with the image
Smoke test
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--batch_size 1 \
--max_train_steps 1 \
--num_gpus 1 \
--wandb_mode disabled
This only checks weights, conversion, and the entrypoint—not model quality.
Custom camera mapping
If LeRobot image keys do not match built-in aliases, pass --camera_map:
docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--camera_map '{"main_realsense_rgb":"observation.images.front","handeye_realsense_rgb":"observation.images.left_wrist","side_realsense_rgb":"observation.images.right_wrist"}' \
--task_name pick_block \
--task_prompt "pick up the block and place it into the box" \
--batch_size 1 \
--max_train_steps 1000 \
--num_gpus 1 \
--wandb_mode disabled
Multi-GPU template
docker run --rm --gpus all \
-v /path/to/robochallenge_or_lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--batch_size 2 \
--max_train_steps 40000 \
--save_steps 2500 \
--log_interval 25 \
--num_workers 2 \
--prefetch_factor 2 \
--num_gpus 8 \
--wandb_mode disabled
Default output:
/path/to/output/spirit_train/latest/
The image bakes Spirit-AI-robotics/Spirit-v1.5 under /models/spirit-v1.5-base. Override with --pretrained_path pointing at a directory containing model.safetensors and config.json.
Common flags
| Flag | Default | Meaning |
|---|---|---|
--data_root | /data/input | Input dataset root. |
--pretrained_path | /models/spirit-v1.5-base | Base weights directory. |
--output_dir | /data/output/spirit_train/latest | Output directory. |
--num_gpus | auto-detect | torchrun world size. |
--batch_size | upstream default 32 (examples start smaller) | Per-GPU batch size. |
--max_train_steps | 40000 | Max training steps. |
--save_steps | 2500 | Checkpoint interval. |
--log_interval | 25 | Logging interval. |
--num_workers | 2 | DataLoader workers. |
--prefetch_factor | 2 | Prefetch factor. |
--wandb_mode | disabled | disabled, offline, or online. |
--task_name | move_objects_into_box | Task name for LeRobot conversion. |
--task_prompt | dataset task or name | Language prompt for conversion. |
--camera_map | auto | JSON map from Spirit camera slots to LeRobot keys. |
Reproduce from upstream Spirit-v1.5
git clone https://github.com/Spirit-AI-Team/spirit-v1.5.git
cd spirit-v1.5
python -m venv .venv
source .venv/bin/activate
pip install -r requirements-base.txt
pip install -r requirements-train.txt
Fine-tuning uses scripts/run_finetune.sh via environment variables:
export DATA_ROOT=/path/to/robochallenge_dataset
export PRETRAINED_PATH=/path/to/Spirit-v1.5
export OUTPUT_DIR=./outputs/my_finetuned_model
export NUM_GPUS=8
export BATCH_SIZE=32
export MAX_TRAIN_STEPS=40000
export WANDB_MODE=disabled
./scripts/run_finetune.sh
Inspect train.py flags:
python train.py --help
Upstream does not auto-convert arbitrary LeRobot dumps to RoboChallenge. ioaitech/train_spirit:1.5 exposes that conversion path, but the end-effector + three-camera constraints still apply.
Troubleshooting
Missing third camera during conversion
Check video features in meta/info.json. If names lack front / wrist / side hints, set --camera_map explicitly.
observation.state too small
Spirit expects ≥8 values: 7-D pose + gripper width. Raw joint vectors are not interchangeable with pose.
Bad --pretrained_path
The directory must contain model.safetensors and config.json. The image ships defaults; source-only runs must download HF weights manually.
OOM
Lower --batch_size and --num_workers, run a short --max_train_steps smoke test, then scale up. Production training expects multiple large GPUs.