Skip to main content

LeRobot v2 vs v3 format differences

This page describes LeRobotDataset format versions, not lerobot software releases. The two must be kept distinct:

CategoryExample versionsMeaning
LeRobot software versionv0.4.0, v0.5.0Release version of the Hugging Face lerobot codebase
LeRobotDataset format versionv2.1, v3.0Version of dataset layout, metadata organization, and loading semantics

As of the official lerobot v0.5.0 release, the primary dataset format remains LeRobotDataset v3.0. There is no new v3.1 or a new format generation. Dataset-related changes in v0.5.0 focus on recording performance, tooling, and editing capabilities—not a format generation bump. LeRobot v0.4.0 · LeRobotDataset v3.0

For local format conversion or visual validation, use LeRobot Studio: open a dataset and choose the target version (v2.1 or v3.0) on export.

Choosing a format version

  • v2.1: Use when your training stack depends on the legacy layout, or when you need one file group per episode for manual inspection. For example, OpenPI training for Pi0 / Pi0.5 currently supports v2.1 only.
  • v3.0: Use for large-scale datasets, when you need fewer files, or when streaming from the Hugging Face Hub.

Summary of differences

Aspectv2.0 / v2.1v3.0
StorageOne file group per episodeMultiple episodes in fewer shared files; locations defined in metadata
Tabular datadata/chunk-XXX/episode_YYYYYY.parquetdata/chunk-XXX/file-YYY.parquet
Videovideos/chunk-XXX/{key}/episode_YYYYYY.mp4videos/{key}/chunk-XXX/file-YYY.mp4
Episode metadatameta/episodes.jsonlChunked Parquet under meta/episodes/
Task metadatameta/tasks.jsonlOfficial docs center on meta/tasks.jsonl; some toolchains also accept or emit meta/tasks.parquet
Path resolutionDerived from episode_index and chunks_sizePath templates in meta/info.json plus per-episode location fields in metadata
Large-scale useFile count grows quicklyBetter fit for large datasets and Hub streaming

v3 design: write shared files at ingest time; reconstruct per-episode views at read time using metadata.

Directory layout

v2.1

v2.1 uses one file group per episode; the tree is straightforward to inspect.

meta/
info.json
episodes.jsonl
tasks.jsonl
data/
chunk-000/
episode_000000.parquet
episode_000001.parquet
...
videos/
chunk-000/
observation.images.front/
episode_000000.mp4
episode_000001.mp4
...

This layout is easy to audit manually; with a very large episode count, filesystem pressure increases.

v3.0

v3.0 merges many episodes into shared Parquet/MP4 shards and records each episode’s span inside those files in metadata.

meta/
info.json
stats.json
tasks.jsonl
episodes/
chunk-000/
file-000.parquet
# Some toolchains may also emit tasks.parquet
data/
chunk-000/
file-000.parquet
videos/
observation.images.front/
chunk-000/
file-000.mp4

Per public documentation, meta/tasks.jsonl remains the primary task file for v3; validators or exporters may also support meta/tasks.parquet. Treat tasks.jsonl as the canonical reference and tasks.parquet as optional compatibility, not the sole required file for v3.

Metadata differences

Episode metadata

In v2, meta/episodes.jsonl has one JSON object per line per episode, typically including:

  • episode_index
  • length
  • tasks / task_index

Paths are usually not stored per episode; they are inferred from episode_index and chunks_size.

In v3, episode rows live in chunked Parquet under meta/episodes/. Beyond length and task fields, they record:

  • Global row span of the episode inside shared parquet
  • Data shard location, e.g. data/chunk_index, data/file_index
  • Video shard location and time span, e.g. from_timestamp, to_timestamp

That is how v3 keeps episode-level access while using shared files.

meta/info.json

For both v2 and v3, meta/info.json is the first file to validate because it defines:

  • codebase_version
  • fps
  • features
  • splits
  • Path templates

The key field is codebase_version:

  • v2.0 / v2.1 → legacy dataset layout
  • v3.0 → new dataset layout

Compared with v2, v3 relies more on path templates + metadata-backed addressing, so constraints on info.json are stricter. Official v3 docs treat data_path and video_path as part of the format definition. Official v3 documentation

Does v0.5.0 change the dataset format?

Split into format vs capabilities.

Format

  • LeRobotDataset remains v3.0
  • No new v3.1
  • No new breaking layout generation

Capabilities

Notable dataset-related items in lerobot v0.5.0 include:

  • Streaming video encoding (encode while recording; less idle time between episodes)
  • Faster training and encoding
  • More dataset tools
  • Subtask support
  • Image-to-video conversion

These affect recording, conversion, editing, and training workflows but do not advance the dataset format generation beyond v3.0. LeRobot v0.5.0 · Streaming video encoding

Is streaming_encoding a format change?

No. It is a recording-time performance option that shifts encoding from “batch after each episode” to “incremental during capture”. It affects encoding timing, CPU/GPU load, and recording latency. It does not define a new LeRobotDataset version or replace the v3.0 layout. Streaming video encoding

Format conversion

Official v2.1v3.0

Hugging Face ships a migration utility that turns per-episode parquet/mp4 into v3.0 shared shards and writes the required episode indexing metadata.

python -m lerobot.datasets.v30.convert_dataset_v21_to_v30 --repo-id=your-name/your-dataset

Best suited to datasets already on the Hugging Face Hub.

Bidirectional conversion with LeRobot Studio (v2.1v3.0)

LeRobot Studio fits local or private data:

  • Open .tar.gz or an extracted folder
  • Review playback and health checks
  • Export with a chosen target version

Typical flows:

  • v2.1v3.0: merge per-episode parquet/mp4 into shared shards
  • v3.0v2.1: split shared shards back to per-episode files
Recommendation

For combined visual inspection and format conversion, LeRobot Studio is usually the most direct option.

Validation checklist

Recommended order:

  1. Open meta/info.json and confirm codebase_version
  2. Verify the directory tree matches that version
  3. Confirm tasks, episodes, and media files are present and consistent

Minimum for v2.1

  • meta/info.json
  • meta/episodes.jsonl
  • meta/tasks.jsonl
  • data/chunk-*/episode_*.parquet
  • videos/chunk-*/.../episode_*.mp4

Minimum for v3.0

  • meta/info.json
  • At least one Parquet file under meta/episodes/
  • data/chunk-*/file-*.parquet
  • videos/.../chunk-*/file-*.mp4
  • meta/tasks.jsonl

If a tool also writes meta/tasks.parquet, treat it as supplementary; do not infer a different format version from it alone.

Training selection notes

Decide separately:

  1. Dataset format (v2.1 vs v3.0)
  2. Framework support for that format (model scripts, pinned lerobot release)

Common cases:

  • Pi0 / OpenPI: often still requires v2.1
  • Current LeRobot training and dataset tooling: oriented toward v3.0

Therefore, choose v2.1 or v3.0 from data compatibility and training stack support, not from the lerobot release headline alone.

Further reading