LeRobotDataset v2 vs v3
This page discusses LeRobotDataset format versions, not lerobot software releases. Keep the distinction explicit:
| Category | Example | Meaning |
|---|---|---|
| LeRobot software | v0.4.0, v0.5.0 | Hugging Face lerobot package releases |
| LeRobotDataset format | v2.1, v3.0 | Dataset directory layout, metadata organization, and loader semantics |
As of the official lerobot v0.5.0 announcement, the dataset format generation is still LeRobotDataset v3.0. There is no new v3.1 or a new format generation bump. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing—not a format generation change. LeRobot v0.4.0 · LeRobotDataset v3.0
For format conversion or visual validation locally, use LeRobot Studio: open a dataset and pick the target version (v2.1 or v3.0) when exporting.
Choosing a format
- Pick
v2.1when your training stack still expects the legacy per-episode file layout, or when you want “one episode, one bundle of files” for manual inspection. For example, OpenPI Pi0/Pi0.5 training currently expectsv2.1. - Pick
v3.0for large-scale datasets, fewer files on disk, or Hub streaming workflows.
Comparison at a glance
| Dimension | v2.0 / v2.1 | v3.0 |
|---|---|---|
| Storage | One file group per episode | Many episodes packed into a few large files, located via metadata |
| Tabular | data/chunk-XXX/episode_YYYYYY.parquet | data/chunk-XXX/file-YYY.parquet |
| Video | videos/chunk-XXX/{key}/episode_YYYYYY.mp4 | videos/{key}/chunk-XXX/file-YYY.mp4 |
| Episode metadata | meta/episodes.jsonl | Sharded Parquet under meta/episodes/ |
| Task metadata | meta/tasks.jsonl | Official docs center on meta/tasks.jsonl; some tools may also emit meta/tasks.parquet |
| Path resolution | Derived from episode_index + chunks_size | Path templates in meta/info.json plus per-episode locators in metadata |
| Large-scale | File count grows quickly | Better suited to large corpora and Hub streaming |
The v3 idea: write shared files, then reconstruct episode-level views from metadata.
Directory layouts
v2.1
v2.1 keeps one episode per group of files—easy to browse manually.
meta/
info.json
episodes.jsonl
tasks.jsonl
data/
chunk-000/
episode_000000.parquet
episode_000001.parquet
...
videos/
chunk-000/
observation.images.front/
episode_000000.mp4
episode_000001.mp4
...
Convenient for inspection, but filesystem pressure grows with episode count.
v3.0
v3.0 packs multiple episodes into shared Parquet/MP4 shards and records where each episode lives inside those files.
meta/
info.json
stats.json
tasks.jsonl
episodes/
chunk-000/
file-000.parquet
# some toolchains may also emit tasks.parquet
data/
chunk-000/
file-000.parquet
videos/
observation.images.front/
chunk-000/
file-000.mp4
Per public docs, meta/tasks.jsonl remains the canonical task metadata for v3; tasks.parquet may appear as a compatibility or supplemental artifact. Treat tasks.jsonl as the primary spec and tasks.parquet as optional—not the sole definition of v3.
Metadata differences
Episode metadata
v2 stores one JSON line per episode in meta/episodes.jsonl, typically including:
episode_indexlengthtasks/task_index
Paths are usually not stored per row; loaders derive them from episode_index and chunks_size.
v3 stores episode rows in sharded Parquet under meta/episodes/, including:
- Global row span inside the shared parquet shard
- Data file coordinates (
data/chunk_index,data/file_index, …) - Video file coordinates and time spans (
from_timestamp,to_timestamp, …)
That is how v3 preserves episode-level access despite shared files.
meta/info.json
For both generations, meta/info.json is the first file to inspect because it defines:
codebase_versionfpsfeaturessplits- path templates
The key field is codebase_version:
v2.0/v2.1→ legacy dataset layoutv3.0→ modern dataset layout
Compared with v2, v3 leans harder on templates + metadata, so constraints on info.json matter more. Official v3 docs treat data_path / video_path patterns as part of the format contract. LeRobotDataset v3 docs
Does lerobot v0.5.0 change the dataset format?
Split format generation from capabilities.
Format generation
LeRobotDatasetremainsv3.0- No new
v3.1 - No new breaking layout generation
Capabilities in v0.5.0
Notable dataset-related updates include:
- Streaming video encoding — encode during capture to reduce gaps between episodes
- Faster training/encoding
- More dataset tooling
- Subtask support
- Image-to-video conversion
These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding
Is streaming_encoding a format bump?
No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture.” It changes encoding timing, CPU/GPU load, and capture latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding
Format conversion
Official v2.1 → v3.0
Hugging Face ships a migration utility that converts per-episode parquet/mp4 trees into the shared-file v3.0 layout and fills episode locators.
- Docs: Migrate
v2.1→v3.0 - Typical command:
python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset
Best suited to datasets already on the Hugging Face Hub.
Bidirectional conversion with LeRobot Studio (v2.1 ↔ v3.0)
LeRobot Studio fits local or private data:
- Open
.tar.gzor an extracted folder - Inspect playback and health checks
- Export with a chosen target version
Common flows:
v2.1 → v3.0— merge per-episode shards into shared filesv3.0 → v2.1— split shared files back into per-episode bundles
When you need both visualization and conversion, LeRobot Studio is usually the fastest path.
Validation checklist
Suggested order:
- Open
meta/info.jsonand confirmcodebase_version - Verify the directory tree matches that generation
- Finally confirm tasks, episodes, and video files are complete
v2.1 minimum artifacts
meta/info.jsonmeta/episodes.jsonlmeta/tasks.jsonldata/chunk-*/episode_*.parquetvideos/chunk-*/.../episode_*.mp4
v3.0 minimum artifacts
meta/info.json- At least one Parquet shard under
meta/episodes/ data/chunk-*/file-*.parquetvideos/.../chunk-*/file-*.mp4meta/tasks.jsonl
If a tool emits meta/tasks.parquet, treat it as supplemental—do not infer the format generation from that file alone.
Practical guidance
Decouple dataset format from training stack support:
- First: which dataset generation do you have (
v2.1vsv3.0)? - Second: does your target model/script/release support that generation?
Typical cases:
- Pi0 / OpenPI — often still expects
v2.1 - Modern LeRobot training + dataset tooling — biased toward
v3.0
Pick v2.1 vs v3.0 based on compatibility with your training pipeline, not only on the headline lerobot software version.