LeRobot v2 vs v3 format differences

This page describes LeRobotDataset format versions, not lerobot software releases. The two must be kept distinct:

Category	Example versions	Meaning
LeRobot software version	`v0.4.0`, `v0.5.0`	Release version of the Hugging Face `lerobot` codebase
LeRobotDataset format version	`v2.1`, `v3.0`	Version of dataset layout, metadata organization, and loading semantics

As of the official lerobot v0.5.0 release, the primary dataset format remains LeRobotDataset v3.0. There is no new v3.1 or a new format generation. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing capabilities—not a format generation bump. LeRobot v0.4.0 · LeRobotDataset v3.0

For local format conversion or visual validation, use LeRobot Studio: open a dataset and choose the target version (v2.1 or v3.0) on export.

Choosing a format version

v2.1: Use when your training stack depends on the legacy layout, or when you need one file group per episode for manual inspection. For example, OpenPI training for Pi0 / Pi0.5 currently supports v2.1 only.
v3.0: Use for large-scale datasets, when you need fewer files, or when streaming from the Hugging Face Hub.

Summary of differences

Aspect	`v2.0 / v2.1`	`v3.0`
Storage	One file group per episode	Multiple episodes in fewer shared files; locations defined in metadata
Tabular data	`data/chunk-XXX/episode_YYYYYY.parquet`	`data/chunk-XXX/file-YYY.parquet`
Video	`videos/chunk-XXX/{key}/episode_YYYYYY.mp4`	`videos/{key}/chunk-XXX/file-YYY.mp4`
Episode metadata	`meta/episodes.jsonl`	Chunked Parquet under `meta/episodes/`
Task metadata	`meta/tasks.jsonl`	Official docs center on `meta/tasks.jsonl`; some toolchains also accept or emit `meta/tasks.parquet`
Path resolution	Derived from `episode_index` and `chunks_size`	Path templates in `meta/info.json` plus per-episode location fields in metadata
Large-scale use	File count grows quickly	Better fit for large datasets and Hub streaming

v3 design: write shared files at ingest time; reconstruct per-episode views at read time using metadata.

Directory layout

`v2.1`

v2.1 uses one file group per episode; the tree is straightforward to inspect.

meta/
  info.json
  episodes.jsonl
  tasks.jsonl
data/
  chunk-000/
    episode_000000.parquet
    episode_000001.parquet
    ...
videos/
  chunk-000/
    observation.images.front/
      episode_000000.mp4
      episode_000001.mp4
      ...

This layout is easy to audit manually; with a very large episode count, filesystem pressure increases.

`v3.0`

v3.0 merges many episodes into shared Parquet/MP4 shards and records each episode’s span inside those files in metadata.

meta/
  info.json
  stats.json
  tasks.jsonl
  episodes/
    chunk-000/
      file-000.parquet
  # Some toolchains may also emit tasks.parquet
data/
  chunk-000/
    file-000.parquet
videos/
  observation.images.front/
    chunk-000/
      file-000.mp4

Per public documentation, meta/tasks.jsonl remains the primary task file for v3; validators or exporters may also support meta/tasks.parquet. Treat tasks.jsonl as the canonical reference and tasks.parquet as optional compatibility, not the sole required file for v3.

Metadata differences

Episode metadata

In v2, meta/episodes.jsonl has one JSON object per line per episode, typically including:

episode_index
length
tasks / task_index

Paths are usually not stored per episode; they are inferred from episode_index and chunks_size.

In v3, episode rows live in chunked Parquet under meta/episodes/. Beyond length and task fields, they record:

Global row span of the episode inside shared parquet
Data shard location, e.g. data/chunk_index, data/file_index
Video shard location and time span, e.g. from_timestamp, to_timestamp

That is how v3 keeps episode-level access while using shared files.

`meta/info.json`

For both v2 and v3, meta/info.json is the first file to validate because it defines:

codebase_version
fps
features
splits
Path templates

The key field is codebase_version:

v2.0 / v2.1 → legacy dataset layout
v3.0 → new dataset layout

Compared with v2, v3 relies more on path templates + metadata-backed addressing, so constraints on info.json are stricter. Official v3 docs treat data_path and video_path as part of the format definition. Official v3 documentation

Does `v0.5.0` change the dataset format?

Split into format vs capabilities.

Format

LeRobotDataset remains v3.0
No new v3.1
No new breaking layout generation

Capabilities

Notable dataset-related items in lerobot v0.5.0 include:

Streaming video encoding (encode while recording; less idle time between episodes)
Faster training and encoding
More dataset tools
Subtask support
Image-to-video conversion

These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding

Is `streaming_encoding` a format change?

No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture”. It affects encoding timing, CPU/GPU load, and recording latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding

Format conversion

Official `v2.1` → `v3.0`

Hugging Face ships a migration utility that turns per-episode parquet/mp4 into v3.0 shared shards and writes the required episode indexing metadata.

Docs: Migrate v2.1 → v3.0
Example:

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset

Best suited to datasets already on the Hugging Face Hub.

Bidirectional conversion with LeRobot Studio (`v2.1` ↔ `v3.0`)

LeRobot Studio fits local or private data:

Open .tar.gz or an extracted folder
Review playback and health checks
Export with a chosen target version

Typical flows:

v2.1 → v3.0: merge per-episode parquet/mp4 into shared shards
v3.0 → v2.1: split shared shards back to per-episode files

Recommendation

For combined visual inspection and format conversion, LeRobot Studio is usually the most direct option.

Validation checklist

Recommended order:

Open meta/info.json and confirm codebase_version
Verify the directory tree matches that version
Confirm tasks, episodes, and media files are present and consistent

Minimum for `v2.1`

meta/info.json
meta/episodes.jsonl
meta/tasks.jsonl
data/chunk-*/episode_*.parquet
videos/chunk-*/.../episode_*.mp4

Minimum for `v3.0`

meta/info.json
At least one Parquet file under meta/episodes/
data/chunk-*/file-*.parquet
videos/.../chunk-*/file-*.mp4
meta/tasks.jsonl

If a tool also writes meta/tasks.parquet, treat it as supplementary; do not infer a different format version from it alone.

Training selection notes

Decide separately:

Dataset format (v2.1 vs v3.0)
Framework support for that format (model scripts, pinned lerobot release)

Common cases:

Pi0 / OpenPI: often still requires v2.1
Current LeRobot training and dataset tooling: oriented toward v3.0

Therefore, choose v2.1 or v3.0 from data compatibility and training stack support, not from the lerobot release headline alone.

Choosing a format version​

Summary of differences​

Directory layout​

v2.1​

v3.0​

Metadata differences​

Episode metadata​

meta/info.json​

Does v0.5.0 change the dataset format?​

Format​

Capabilities​

Is streaming_encoding a format change?​

Format conversion​

Official v2.1 → v3.0​

Bidirectional conversion with LeRobot Studio (v2.1 ↔ v3.0)​

Validation checklist​

Minimum for v2.1​

Minimum for v3.0​

Training selection notes​

Further reading​