Skip to main content

Spirit-v1.5 fine-tuning

Spirit-v1.5 is Spirit AI’s open VLA model. Upstream code lives at Spirit-AI-Team/spirit-v1.5; the model card is Spirit-AI-robotics/Spirit-v1.5.

IO-AI publishes ioaitech/train_spirit:1.5, a Docker image built on that upstream tree. It includes the Spirit-v1.5 training environment, Hugging Face base weights, a fine-tuning entrypoint, and LeRobot → RoboChallenge dataset layout conversion. When your data satisfies the constraints below, you can mount either a native RoboChallenge tree or a compatible LeRobot export and start fine-tuning.

The model card describes a Qwen/Qwen3-VL-4B-Instruct backbone with a DiT (Diffusion Transformer) action head. Upstream ships train.py and scripts/run_finetune.sh; the README notes fine-tuning code as of 2026-04.

Default image on Docker Hub: ioaitech/train_spirit:1.5.

Scope

Spirit-v1.5 is a frontier model—not your first training baseline. Validate data quality with ACT or SmolVLA before committing here.

It is a reasonable choice when:

  • You want a large VLA backbone and have multi-GPU capacity.
  • Your data matches RoboChallenge-style state and camera layout.
  • Task text and continuous actions are clean and consistent.
  • You accept strict schema requirements and higher training cost.

Upstream recommends multi-GPU training and cites NVIDIA A100 80GB in their test notes. Consumer single-GPU runs are useful for reading code and tiny smoke tests, not as a promise of full fine-tuning.

Data requirements

Upstream training data follows the RoboChallenge layout:

dataset/
├── meta/
│ └── task_info.json
└── data/
└── episode_000000/
├── states/
│ └── states.jsonl
└── videos/
├── handeye_realsense_rgb.mp4
├── main_realsense_rgb.mp4
└── side_realsense_rgb.mp4

ioaitech/train_spirit:1.5 accepts:

  • Native RoboChallenge datasets (train directly).
  • Compatible LeRobot datasets (converted to RoboChallenge inside the container).

LeRobot conversion is strict. You need:

  • observation.state with ≥8 dims: first 7 = end-effector pose, 8th = gripper width.
  • Three RGB video streams mapped to main_realsense_rgb, handeye_realsense_rgb, side_realsense_rgb.
  • meta/tasks.jsonl or CLI task text to build task_info.json.

If your LeRobot state is joint angles instead of an end-effector pose, do not train Spirit-v1.5 until you transform the state space.

Training with the image

Smoke test

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--batch_size 1 \
--max_train_steps 1 \
--num_gpus 1 \
--wandb_mode disabled

This only checks weights, conversion, and the entrypoint—not model quality.

Custom camera mapping

If LeRobot image keys do not match built-in aliases, pass --camera_map:

docker run --rm --gpus all \
-v /path/to/lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--camera_map '{"main_realsense_rgb":"observation.images.front","handeye_realsense_rgb":"observation.images.left_wrist","side_realsense_rgb":"observation.images.right_wrist"}' \
--task_name pick_block \
--task_prompt "pick up the block and place it into the box" \
--batch_size 1 \
--max_train_steps 1000 \
--num_gpus 1 \
--wandb_mode disabled

Multi-GPU template

docker run --rm --gpus all \
-v /path/to/robochallenge_or_lerobot_dataset:/data/input:rw \
-v /path/to/output:/data/output:rw \
ioaitech/train_spirit:1.5 \
--batch_size 2 \
--max_train_steps 40000 \
--save_steps 2500 \
--log_interval 25 \
--num_workers 2 \
--prefetch_factor 2 \
--num_gpus 8 \
--wandb_mode disabled

Default output:

/path/to/output/spirit_train/latest/

The image bakes Spirit-AI-robotics/Spirit-v1.5 under /models/spirit-v1.5-base. Override with --pretrained_path pointing at a directory containing model.safetensors and config.json.

Common flags

FlagDefaultMeaning
--data_root/data/inputInput dataset root.
--pretrained_path/models/spirit-v1.5-baseBase weights directory.
--output_dir/data/output/spirit_train/latestOutput directory.
--num_gpusauto-detecttorchrun world size.
--batch_sizeupstream default 32 (examples start smaller)Per-GPU batch size.
--max_train_steps40000Max training steps.
--save_steps2500Checkpoint interval.
--log_interval25Logging interval.
--num_workers2DataLoader workers.
--prefetch_factor2Prefetch factor.
--wandb_modedisableddisabled, offline, or online.
--task_namemove_objects_into_boxTask name for LeRobot conversion.
--task_promptdataset task or nameLanguage prompt for conversion.
--camera_mapautoJSON map from Spirit camera slots to LeRobot keys.

Reproduce from upstream Spirit-v1.5

git clone https://github.com/Spirit-AI-Team/spirit-v1.5.git
cd spirit-v1.5

python -m venv .venv
source .venv/bin/activate
pip install -r requirements-base.txt
pip install -r requirements-train.txt

Fine-tuning uses scripts/run_finetune.sh via environment variables:

export DATA_ROOT=/path/to/robochallenge_dataset
export PRETRAINED_PATH=/path/to/Spirit-v1.5
export OUTPUT_DIR=./outputs/my_finetuned_model
export NUM_GPUS=8
export BATCH_SIZE=32
export MAX_TRAIN_STEPS=40000
export WANDB_MODE=disabled

./scripts/run_finetune.sh

Inspect train.py flags:

python train.py --help

Upstream does not auto-convert arbitrary LeRobot dumps to RoboChallenge. ioaitech/train_spirit:1.5 exposes that conversion path, but the end-effector + three-camera constraints still apply.

Troubleshooting

Missing third camera during conversion

Check video features in meta/info.json. If names lack front / wrist / side hints, set --camera_map explicitly.

observation.state too small

Spirit expects ≥8 values: 7-D pose + gripper width. Raw joint vectors are not interchangeable with pose.

Bad --pretrained_path

The directory must contain model.safetensors and config.json. The image ships defaults; source-only runs must download HF weights manually.

OOM

Lower --batch_size and --num_workers, run a short --max_train_steps smoke test, then scale up. Production training expects multiple large GPUs.

References