Skip to main content

LeRobotDataset v2 vs v3

This page discusses LeRobotDataset format versions, not lerobot software releases. Keep the distinction explicit:

CategoryExampleMeaning
LeRobot softwarev0.4.0, v0.5.0Hugging Face lerobot package releases
LeRobotDataset formatv2.1, v3.0Dataset directory layout, metadata organization, and loader semantics

As of the official lerobot v0.5.0 announcement, the dataset format generation is still LeRobotDataset v3.0. There is no new v3.1 or a new format generation bump. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing—not a format generation change. LeRobot v0.4.0 · LeRobotDataset v3.0

For format conversion or visual validation locally, use LeRobot Studio: open a dataset and pick the target version (v2.1 or v3.0) when exporting.

Choosing a format

  • Pick v2.1 when your training stack still expects the legacy per-episode file layout, or when you want “one episode, one bundle of files” for manual inspection. For example, OpenPI Pi0/Pi0.5 training currently expects v2.1.
  • Pick v3.0 for large-scale datasets, fewer files on disk, or Hub streaming workflows.

Comparison at a glance

Dimensionv2.0 / v2.1v3.0
StorageOne file group per episodeMany episodes packed into a few large files, located via metadata
Tabulardata/chunk-XXX/episode_YYYYYY.parquetdata/chunk-XXX/file-YYY.parquet
Videovideos/chunk-XXX/{key}/episode_YYYYYY.mp4videos/{key}/chunk-XXX/file-YYY.mp4
Episode metadatameta/episodes.jsonlSharded Parquet under meta/episodes/
Task metadatameta/tasks.jsonlOfficial docs center on meta/tasks.jsonl; some tools may also emit meta/tasks.parquet
Path resolutionDerived from episode_index + chunks_sizePath templates in meta/info.json plus per-episode locators in metadata
Large-scaleFile count grows quicklyBetter suited to large corpora and Hub streaming

The v3 idea: write shared files, then reconstruct episode-level views from metadata.

Directory layouts

v2.1

v2.1 keeps one episode per group of files—easy to browse manually.

meta/
info.json
episodes.jsonl
tasks.jsonl
data/
chunk-000/
episode_000000.parquet
episode_000001.parquet
...
videos/
chunk-000/
observation.images.front/
episode_000000.mp4
episode_000001.mp4
...

Convenient for inspection, but filesystem pressure grows with episode count.

v3.0

v3.0 packs multiple episodes into shared Parquet/MP4 shards and records where each episode lives inside those files.

meta/
info.json
stats.json
tasks.jsonl
episodes/
chunk-000/
file-000.parquet
# some toolchains may also emit tasks.parquet
data/
chunk-000/
file-000.parquet
videos/
observation.images.front/
chunk-000/
file-000.mp4

Per public docs, meta/tasks.jsonl remains the canonical task metadata for v3; tasks.parquet may appear as a compatibility or supplemental artifact. Treat tasks.jsonl as the primary spec and tasks.parquet as optional—not the sole definition of v3.

Metadata differences

Episode metadata

v2 stores one JSON line per episode in meta/episodes.jsonl, typically including:

  • episode_index
  • length
  • tasks / task_index

Paths are usually not stored per row; loaders derive them from episode_index and chunks_size.

v3 stores episode rows in sharded Parquet under meta/episodes/, including:

  • Global row span inside the shared parquet shard
  • Data file coordinates (data/chunk_index, data/file_index, …)
  • Video file coordinates and time spans (from_timestamp, to_timestamp, …)

That is how v3 preserves episode-level access despite shared files.

meta/info.json

For both generations, meta/info.json is the first file to inspect because it defines:

  • codebase_version
  • fps
  • features
  • splits
  • path templates

The key field is codebase_version:

  • v2.0 / v2.1 → legacy dataset layout
  • v3.0 → modern dataset layout

Compared with v2, v3 leans harder on templates + metadata, so constraints on info.json matter more. Official v3 docs treat data_path / video_path patterns as part of the format contract. LeRobotDataset v3 docs

Does lerobot v0.5.0 change the dataset format?

Split format generation from capabilities.

Format generation

  • LeRobotDataset remains v3.0
  • No new v3.1
  • No new breaking layout generation

Capabilities in v0.5.0

Notable dataset-related updates include:

  • Streaming video encoding — encode during capture to reduce gaps between episodes
  • Faster training/encoding
  • More dataset tooling
  • Subtask support
  • Image-to-video conversion

These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding

Is streaming_encoding a format bump?

No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture.” It changes encoding timing, CPU/GPU load, and capture latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding

Format conversion

Official v2.1 → v3.0

Hugging Face ships a migration utility that converts per-episode parquet/mp4 trees into the shared-file v3.0 layout and fills episode locators.

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset

Best suited to datasets already on the Hugging Face Hub.

Bidirectional conversion with LeRobot Studio (v2.1v3.0)

LeRobot Studio fits local or private data:

  • Open .tar.gz or an extracted folder
  • Inspect playback and health checks
  • Export with a chosen target version

Common flows:

  • v2.1 → v3.0 — merge per-episode shards into shared files
  • v3.0 → v2.1 — split shared files back into per-episode bundles
Recommendation

When you need both visualization and conversion, LeRobot Studio is usually the fastest path.

Validation checklist

Suggested order:

  1. Open meta/info.json and confirm codebase_version
  2. Verify the directory tree matches that generation
  3. Finally confirm tasks, episodes, and video files are complete

v2.1 minimum artifacts

  • meta/info.json
  • meta/episodes.jsonl
  • meta/tasks.jsonl
  • data/chunk-*/episode_*.parquet
  • videos/chunk-*/.../episode_*.mp4

v3.0 minimum artifacts

  • meta/info.json
  • At least one Parquet shard under meta/episodes/
  • data/chunk-*/file-*.parquet
  • videos/.../chunk-*/file-*.mp4
  • meta/tasks.jsonl

If a tool emits meta/tasks.parquet, treat it as supplemental—do not infer the format generation from that file alone.

Practical guidance

Decouple dataset format from training stack support:

  • First: which dataset generation do you have (v2.1 vs v3.0)?
  • Second: does your target model/script/release support that generation?

Typical cases:

  • Pi0 / OpenPI — often still expects v2.1
  • Modern LeRobot training + dataset tooling — biased toward v3.0

Pick v2.1 vs v3.0 based on compatibility with your training pipeline, not only on the headline lerobot software version.

Further reading