LeRobotDataset v2 vs v3

This page discusses LeRobotDataset format versions, not lerobot software releases. Keep the distinction explicit:

Category	Example	Meaning
LeRobot software	`v0.4.0`, `v0.5.0`	Hugging Face `lerobot` package releases
LeRobotDataset format	`v2.1`, `v3.0`	Dataset directory layout, metadata organization, and loader semantics

As of the official lerobot v0.5.0 announcement, the dataset format generation is still LeRobotDataset v3.0. There is no new v3.1 or a new format generation bump. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing—not a format generation change. LeRobot v0.4.0 · LeRobotDataset v3.0

For format conversion or visual validation locally, use LeRobot Studio: open a dataset and pick the target version (v2.1 or v3.0) when exporting.

Choosing a format

Pick v2.1 when your training stack still expects the legacy per-episode file layout, or when you want “one episode, one bundle of files” for manual inspection. For example, OpenPI Pi0/Pi0.5 training currently expects v2.1.
Pick v3.0 for large-scale datasets, fewer files on disk, or Hub streaming workflows.

Comparison at a glance

Dimension	`v2.0 / v2.1`	`v3.0`
Storage	One file group per episode	Many episodes packed into a few large files, located via metadata
Tabular	`data/chunk-XXX/episode_YYYYYY.parquet`	`data/chunk-XXX/file-YYY.parquet`
Video	`videos/chunk-XXX/{key}/episode_YYYYYY.mp4`	`videos/{key}/chunk-XXX/file-YYY.mp4`
Episode metadata	`meta/episodes.jsonl`	Sharded Parquet under `meta/episodes/`
Task metadata	`meta/tasks.jsonl`	Official docs center on `meta/tasks.jsonl`; some tools may also emit `meta/tasks.parquet`
Path resolution	Derived from `episode_index` + `chunks_size`	Path templates in `meta/info.json` plus per-episode locators in metadata
Large-scale	File count grows quickly	Better suited to large corpora and Hub streaming

The v3 idea: write shared files, then reconstruct episode-level views from metadata.

Directory layouts

`v2.1`

v2.1 keeps one episode per group of files—easy to browse manually.

meta/
  info.json
  episodes.jsonl
  tasks.jsonl
data/
  chunk-000/
    episode_000000.parquet
    episode_000001.parquet
    ...
videos/
  chunk-000/
    observation.images.front/
      episode_000000.mp4
      episode_000001.mp4
      ...

Convenient for inspection, but filesystem pressure grows with episode count.

`v3.0`

v3.0 packs multiple episodes into shared Parquet/MP4 shards and records where each episode lives inside those files.

meta/
  info.json
  stats.json
  tasks.jsonl
  episodes/
    chunk-000/
      file-000.parquet
  # some toolchains may also emit tasks.parquet
data/
  chunk-000/
    file-000.parquet
videos/
  observation.images.front/
    chunk-000/
      file-000.mp4

Per public docs, meta/tasks.jsonl remains the canonical task metadata for v3; tasks.parquet may appear as a compatibility or supplemental artifact. Treat tasks.jsonl as the primary spec and tasks.parquet as optional—not the sole definition of v3.

Metadata differences

Episode metadata

v2 stores one JSON line per episode in meta/episodes.jsonl, typically including:

episode_index
length
tasks / task_index

Paths are usually not stored per row; loaders derive them from episode_index and chunks_size.

v3 stores episode rows in sharded Parquet under meta/episodes/, including:

Global row span inside the shared parquet shard
Data file coordinates (data/chunk_index, data/file_index, …)
Video file coordinates and time spans (from_timestamp, to_timestamp, …)

That is how v3 preserves episode-level access despite shared files.

`meta/info.json`

For both generations, meta/info.json is the first file to inspect because it defines:

codebase_version
fps
features
splits
path templates

The key field is codebase_version:

v2.0 / v2.1 → legacy dataset layout
v3.0 → modern dataset layout

Compared with v2, v3 leans harder on templates + metadata, so constraints on info.json matter more. Official v3 docs treat data_path / video_path patterns as part of the format contract. LeRobotDataset v3 docs

Does `lerobot v0.5.0` change the dataset format?

Split format generation from capabilities.

Format generation

LeRobotDataset remains v3.0
No new v3.1
No new breaking layout generation

Capabilities in `v0.5.0`

Notable dataset-related updates include:

Streaming video encoding — encode during capture to reduce gaps between episodes
Faster training/encoding
More dataset tooling
Subtask support
Image-to-video conversion

These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding

Is `streaming_encoding` a format bump?

No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture.” It changes encoding timing, CPU/GPU load, and capture latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding

Format conversion

Official `v2.1 → v3.0`

Hugging Face ships a migration utility that converts per-episode parquet/mp4 trees into the shared-file v3.0 layout and fills episode locators.

Docs: Migrate v2.1 → v3.0
Typical command:

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset

Best suited to datasets already on the Hugging Face Hub.

Bidirectional conversion with LeRobot Studio (`v2.1` ↔ `v3.0`)

LeRobot Studio fits local or private data:

Open .tar.gz or an extracted folder
Inspect playback and health checks
Export with a chosen target version

Common flows:

v2.1 → v3.0 — merge per-episode shards into shared files
v3.0 → v2.1 — split shared files back into per-episode bundles

Recommendation

When you need both visualization and conversion, LeRobot Studio is usually the fastest path.

Validation checklist

Suggested order:

Open meta/info.json and confirm codebase_version
Verify the directory tree matches that generation
Finally confirm tasks, episodes, and video files are complete

`v2.1` minimum artifacts

meta/info.json
meta/episodes.jsonl
meta/tasks.jsonl
data/chunk-*/episode_*.parquet
videos/chunk-*/.../episode_*.mp4

`v3.0` minimum artifacts

meta/info.json
At least one Parquet shard under meta/episodes/
data/chunk-*/file-*.parquet
videos/.../chunk-*/file-*.mp4
meta/tasks.jsonl

If a tool emits meta/tasks.parquet, treat it as supplemental—do not infer the format generation from that file alone.

Practical guidance

Decouple dataset format from training stack support:

First: which dataset generation do you have (v2.1 vs v3.0)?
Second: does your target model/script/release support that generation?

Typical cases:

Pi0 / OpenPI — often still expects v2.1
Modern LeRobot training + dataset tooling — biased toward v3.0

Pick v2.1 vs v3.0 based on compatibility with your training pipeline, not only on the headline lerobot software version.

Choosing a format​

Comparison at a glance​

Directory layouts​

v2.1​

v3.0​

Metadata differences​

Episode metadata​

meta/info.json​

Does lerobot v0.5.0 change the dataset format?​

Format generation​

Capabilities in v0.5.0​

Is streaming_encoding a format bump?​

Format conversion​

Official v2.1 → v3.0​

Bidirectional conversion with LeRobot Studio (v2.1 ↔ v3.0)​

Validation checklist​

v2.1 minimum artifacts​

v3.0 minimum artifacts​

Practical guidance​

Further reading​