Spirit-v1.5 fine-tuning

Spirit-v1.5 is Spirit AI’s open VLA model. Upstream code lives at Spirit-AI-Team/spirit-v1.5; the model card is Spirit-AI-robotics/Spirit-v1.5.

IO-AI publishes ioaitech/train_spirit:1.5, a Docker image built on that upstream tree. It includes the Spirit-v1.5 training environment, Hugging Face base weights, a fine-tuning entrypoint, and LeRobot → RoboChallenge dataset layout conversion. When your data satisfies the constraints below, you can mount either a native RoboChallenge tree or a compatible LeRobot export and start fine-tuning.

The model card describes a Qwen/Qwen3-VL-4B-Instruct backbone with a DiT (Diffusion Transformer) action head. Upstream ships train.py and scripts/run_finetune.sh; the README notes fine-tuning code as of 2026-04.

Default image on Docker Hub: ioaitech/train_spirit:1.5.

Scope

Spirit-v1.5 is a frontier model—not your first training baseline. Validate data quality with ACT or SmolVLA before committing here.

It is a reasonable choice when:

You want a large VLA backbone and have multi-GPU capacity.
Your data matches RoboChallenge-style state and camera layout.
Task text and continuous actions are clean and consistent.
You accept strict schema requirements and higher training cost.

Upstream recommends multi-GPU training and cites NVIDIA A100 80GB in their test notes. Consumer single-GPU runs are useful for reading code and tiny smoke tests, not as a promise of full fine-tuning.

Data requirements

Upstream training data follows the RoboChallenge layout:

dataset/
├── meta/
│   └── task_info.json
└── data/
    └── episode_000000/
        ├── states/
        │   └── states.jsonl
        └── videos/
            ├── handeye_realsense_rgb.mp4
            ├── main_realsense_rgb.mp4
            └── side_realsense_rgb.mp4

ioaitech/train_spirit:1.5 accepts:

Native RoboChallenge datasets (train directly).
Compatible LeRobot datasets (converted to RoboChallenge inside the container).

LeRobot conversion is strict. You need:

observation.state with ≥8 dims: first 7 = end-effector pose, 8th = gripper width.
Three RGB video streams mapped to main_realsense_rgb, handeye_realsense_rgb, side_realsense_rgb.
meta/tasks.jsonl or CLI task text to build task_info.json.

If your LeRobot state is joint angles instead of an end-effector pose, do not train Spirit-v1.5 until you transform the state space.

Training with the image

Smoke test

docker run --rm --gpus all \
  -v /path/to/lerobot_dataset:/data/input:rw \
  -v /path/to/output:/data/output:rw \
  ioaitech/train_spirit:1.5 \
  --batch_size 1 \
  --max_train_steps 1 \
  --num_gpus 1 \
  --wandb_mode disabled

This only checks weights, conversion, and the entrypoint—not model quality.

Custom camera mapping

If LeRobot image keys do not match built-in aliases, pass --camera_map:

docker run --rm --gpus all \
  -v /path/to/lerobot_dataset:/data/input:rw \
  -v /path/to/output:/data/output:rw \
  ioaitech/train_spirit:1.5 \
  --camera_map '{"main_realsense_rgb":"observation.images.front","handeye_realsense_rgb":"observation.images.left_wrist","side_realsense_rgb":"observation.images.right_wrist"}' \
  --task_name pick_block \
  --task_prompt "pick up the block and place it into the box" \
  --batch_size 1 \
  --max_train_steps 1000 \
  --num_gpus 1 \
  --wandb_mode disabled

Multi-GPU template

docker run --rm --gpus all \
  -v /path/to/robochallenge_or_lerobot_dataset:/data/input:rw \
  -v /path/to/output:/data/output:rw \
  ioaitech/train_spirit:1.5 \
  --batch_size 2 \
  --max_train_steps 40000 \
  --save_steps 2500 \
  --log_interval 25 \
  --num_workers 2 \
  --prefetch_factor 2 \
  --num_gpus 8 \
  --wandb_mode disabled

Default output:

/path/to/output/spirit_train/latest/

The image bakes Spirit-AI-robotics/Spirit-v1.5 under /models/spirit-v1.5-base. Override with --pretrained_path pointing at a directory containing model.safetensors and config.json.

Common flags

Flag	Default	Meaning
`--data_root`	`/data/input`	Input dataset root.
`--pretrained_path`	`/models/spirit-v1.5-base`	Base weights directory.
`--output_dir`	`/data/output/spirit_train/latest`	Output directory.
`--num_gpus`	auto-detect	`torchrun` world size.
`--batch_size`	upstream default `32` (examples start smaller)	Per-GPU batch size.
`--max_train_steps`	`40000`	Max training steps.
`--save_steps`	`2500`	Checkpoint interval.
`--log_interval`	`25`	Logging interval.
`--num_workers`	`2`	DataLoader workers.
`--prefetch_factor`	`2`	Prefetch factor.
`--wandb_mode`	`disabled`	`disabled`, `offline`, or `online`.
`--task_name`	`move_objects_into_box`	Task name for LeRobot conversion.
`--task_prompt`	dataset task or name	Language prompt for conversion.
`--camera_map`	auto	JSON map from Spirit camera slots to LeRobot keys.

Reproduce from upstream Spirit-v1.5

git clone https://github.com/Spirit-AI-Team/spirit-v1.5.git
cd spirit-v1.5

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-base.txt
pip install -r requirements-train.txt

Fine-tuning uses scripts/run_finetune.sh via environment variables:

export DATA_ROOT=/path/to/robochallenge_dataset
export PRETRAINED_PATH=/path/to/Spirit-v1.5
export OUTPUT_DIR=./outputs/my_finetuned_model
export NUM_GPUS=8
export BATCH_SIZE=32
export MAX_TRAIN_STEPS=40000
export WANDB_MODE=disabled

./scripts/run_finetune.sh

Inspect train.py flags:

python train.py --help

Upstream does not auto-convert arbitrary LeRobot dumps to RoboChallenge. ioaitech/train_spirit:1.5 exposes that conversion path, but the end-effector + three-camera constraints still apply.

Troubleshooting

Missing third camera during conversion

Check video features in meta/info.json. If names lack front / wrist / side hints, set --camera_map explicitly.

`observation.state` too small

Spirit expects ≥8 values: 7-D pose + gripper width. Raw joint vectors are not interchangeable with pose.

Bad `--pretrained_path`

The directory must contain model.safetensors and config.json. The image ships defaults; source-only runs must download HF weights manually.

OOM

Lower --batch_size and --num_workers, run a short --max_train_steps smoke test, then scale up. Production training expects multiple large GPUs.

Scope​

Data requirements​

Training with the image​

Smoke test​

Custom camera mapping​

Multi-GPU template​

Common flags​

Reproduce from upstream Spirit-v1.5​

Troubleshooting​

Missing third camera during conversion​

observation.state too small​

Bad --pretrained_path​

OOM​

References​