LeRobot v2 vs v3 format differences
This page describes LeRobotDataset format versions, not lerobot software releases. The two must be kept distinct:
| Category | Example versions | Meaning |
|---|---|---|
| LeRobot software version | v0.4.0, v0.5.0 | Release version of the Hugging Face lerobot codebase |
| LeRobotDataset format version | v2.1, v3.0 | Version of dataset layout, metadata organization, and loading semantics |
As of the official lerobot v0.5.0 release, the primary dataset format remains LeRobotDataset v3.0. There is no new v3.1 or a new format generation. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing capabilities—not a format generation bump. LeRobot v0.4.0 · LeRobotDataset v3.0
For local format conversion or visual validation, use LeRobot Studio: open a dataset and choose the target version (v2.1 or v3.0) on export.
Choosing a format version
v2.1: Use when your training stack depends on the legacy layout, or when you need one file group per episode for manual inspection. For example, OpenPI training for Pi0 / Pi0.5 currently supportsv2.1only.v3.0: Use for large-scale datasets, when you need fewer files, or when streaming from the Hugging Face Hub.
Summary of differences
| Aspect | v2.0 / v2.1 | v3.0 |
|---|---|---|
| Storage | One file group per episode | Multiple episodes in fewer shared files; locations defined in metadata |
| Tabular data | data/chunk-XXX/episode_YYYYYY.parquet | data/chunk-XXX/file-YYY.parquet |
| Video | videos/chunk-XXX/{key}/episode_YYYYYY.mp4 | videos/{key}/chunk-XXX/file-YYY.mp4 |
| Episode metadata | meta/episodes.jsonl | Chunked Parquet under meta/episodes/ |
| Task metadata | meta/tasks.jsonl | Official docs center on meta/tasks.jsonl; some toolchains also accept or emit meta/tasks.parquet |
| Path resolution | Derived from episode_index and chunks_size | Path templates in meta/info.json plus per-episode location fields in metadata |
| Large-scale use | File count grows quickly | Better fit for large datasets and Hub streaming |
v3 design: write shared files at ingest time; reconstruct per-episode views at read time using metadata.
Directory layout
v2.1
v2.1 uses one file group per episode; the tree is straightforward to inspect.
meta/
info.json
episodes.jsonl
tasks.jsonl
data/
chunk-000/
episode_000000.parquet
episode_000001.parquet
...
videos/
chunk-000/
observation.images.front/
episode_000000.mp4
episode_000001.mp4
...
This layout is easy to audit manually; with a very large episode count, filesystem pressure increases.
v3.0
v3.0 merges many episodes into shared Parquet/MP4 shards and records each episode’s span inside those files in metadata.
meta/
info.json
stats.json
tasks.jsonl
episodes/
chunk-000/
file-000.parquet
# Some toolchains may also emit tasks.parquet
data/
chunk-000/
file-000.parquet
videos/
observation.images.front/
chunk-000/
file-000.mp4
Per public documentation, meta/tasks.jsonl remains the primary task file for v3; validators or exporters may also support meta/tasks.parquet. Treat tasks.jsonl as the canonical reference and tasks.parquet as optional compatibility, not the sole required file for v3.
Metadata differences
Episode metadata
In v2, meta/episodes.jsonl has one JSON object per line per episode, typically including:
episode_indexlengthtasks/task_index
Paths are usually not stored per episode; they are inferred from episode_index and chunks_size.
In v3, episode rows live in chunked Parquet under meta/episodes/. Beyond length and task fields, they record:
- Global row span of the episode inside shared parquet
- Data shard location, e.g.
data/chunk_index,data/file_index - Video shard location and time span, e.g.
from_timestamp,to_timestamp
That is how v3 keeps episode-level access while using shared files.
meta/info.json
For both v2 and v3, meta/info.json is the first file to validate because it defines:
codebase_versionfpsfeaturessplits- Path templates
The key field is codebase_version:
v2.0/v2.1→ legacy dataset layoutv3.0→ new dataset layout
Compared with v2, v3 relies more on path templates + metadata-backed addressing, so constraints on info.json are stricter. Official v3 docs treat data_path and video_path as part of the format definition. Official v3 documentation
Does v0.5.0 change the dataset format?
Split into format vs capabilities.
Format
LeRobotDatasetremainsv3.0- No new
v3.1 - No new breaking layout generation
Capabilities
Notable dataset-related items in lerobot v0.5.0 include:
- Streaming video encoding (encode while recording; less idle time between episodes)
- Faster training and encoding
- More dataset tools
- Subtask support
- Image-to-video conversion
These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding
Is streaming_encoding a format change?
No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture”. It affects encoding timing, CPU/GPU load, and recording latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding
Format conversion
Official v2.1 → v3.0
Hugging Face ships a migration utility that turns per-episode parquet/mp4 into v3.0 shared shards and writes the required episode indexing metadata.
- Docs: Migrate
v2.1→v3.0 - Example:
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset
Best suited to datasets already on the Hugging Face Hub.
Bidirectional conversion with LeRobot Studio (v2.1 ↔ v3.0)
LeRobot Studio fits local or private data:
- Open
.tar.gzor an extracted folder - Review playback and health checks
- Export with a chosen target version
Typical flows:
v2.1→v3.0: merge per-episode parquet/mp4 into shared shardsv3.0→v2.1: split shared shards back to per-episode files
For combined visual inspection and format conversion, LeRobot Studio is usually the most direct option.
Validation checklist
Recommended order:
- Open
meta/info.jsonand confirmcodebase_version - Verify the directory tree matches that version
- Confirm tasks, episodes, and media files are present and consistent
Minimum for v2.1
meta/info.jsonmeta/episodes.jsonlmeta/tasks.jsonldata/chunk-*/episode_*.parquetvideos/chunk-*/.../episode_*.mp4
Minimum for v3.0
meta/info.json- At least one Parquet file under
meta/episodes/ data/chunk-*/file-*.parquetvideos/.../chunk-*/file-*.mp4meta/tasks.jsonl
If a tool also writes meta/tasks.parquet, treat it as supplementary; do not infer a different format version from it alone.
Training selection notes
Decide separately:
- Dataset format (
v2.1vsv3.0) - Framework support for that format (model scripts, pinned
lerobotrelease)
Common cases:
- Pi0 / OpenPI: often still requires
v2.1 - Current LeRobot training and dataset tooling: oriented toward
v3.0
Therefore, choose v2.1 or v3.0 from data compatibility and training stack support, not from the lerobot release headline alone.