Data Formats

The EmbodiFlow is designed for universal robot data management, using Robot Operating System (ROS) as the standard for unified robot data management.

Data Import: Supports automatic conversion of non-ROS standard data from systems like Zhiyuan, Songling, and other data collection systems to ROS standard format for unified management.
Data Visualization: Built-in visualization models for 30+ mainstream robots, enabling smooth playback of 3D animations and planar images in all formats.
Data Export: Supports one-click export of standard HDF5/LeRobot data formats, with adaptive joints and images based on original data, ready for direct model training.

Human Data Format
Teleoperation Robot Data Format
Export Model Training Data
- HDF5 Format
- LeRobot Format

Human Data Format

Human data collection is primarily used to record operator actions and interaction processes, containing multimodal sensor data.

File Structure

Each collection task generates a timestamp-named folder:

f"{date}_{project}_{scene}_{task}_{staff_id}_{timestamp}"
├── align_result.csv    # Timestamp alignment table
├── annotation.json     # Annotation data
├── config/            # Camera and sensor configuration
│   ├── calib_data.yml
│   ├── depth_to_rgb.yml
│   ├── mocap_main.yml
│   ├── orbbec_depth.yml
│   ├── orbbec_rgb.yml
│   └── pose_calib.yml
└── data.mcap          # Multimodal data package

Multimodal Data

The data.mcap file contains synchronized data from all sensors, stored in MCAP format.

Main Topic List:

Topic Name	Data Type	Description
`/mocap/sensor_data`	`io_msgs/squashed_mocap_data`	Motion capture joint velocity, acceleration, angular velocity, rotation angle, and sensor data
`/mocap/ros_tf`	`tf2_msgs/TFMessage`	TF transformations for all joints based on motion capture
`/joint_states`	`sensor_msgs/JointState`	JointState for all joints based on motion capture
`/rgbd/color/image_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from main head camera
`/rgbd/depth/image_raw`	`sensor_msgs/Image`	Depth image from main head camera
`/colorized_depth`	`sensor_msgs/CompressedImage`	Colorized depth image from main head camera
`/left_ee_pose`	`geometry_msgs/PoseStamped`	Left gripper pose in main head camera coordinate system
`/right_ee_pose`	`geometry_msgs/PoseStamped`	Right gripper pose in main head camera coordinate system
`/claws_l_hand`	`io_msgs/claws_angle`	Left gripper closure degree
`/claws_r_hand`	`io_msgs/claws_angle`	Right gripper closure degree
`/claws_touch_data`	`io_msgs/squashed_touch`	Gripper tactile data
`/realsense_left_hand/color/image_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from left gripper camera
`/realsense_left_hand/depth/image_rect_raw`	`sensor_msgs/Image`	Depth image from left gripper camera
`/realsense_right_hand/color/image_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from right gripper camera
`/realsense_right_hand/depth/image_rect_raw`	`sensor_msgs/Image`	Depth image from right gripper camera
`/usb_cam_fisheye/mjpeg_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from main head fisheye camera
`/usb_cam_left/mjpeg_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from main head left monocular camera
`/usb_cam_right/mjpeg_raw/compressed`	`sensor_msgs/CompressedImage`	RGB image from main head right monocular camera
`/ee_visualization`	`sensor_msgs/CompressedImage`	End-effector pose visualization in main head camera RGB image
`/touch_visualization`	`sensor_msgs/CompressedImage`	Gripper tactile data visualization
`/robot_description`	`std_msgs/String`	Motion capture URDF
`/global_localization`	`geometry_msgs/PoseStamped`	Main head camera pose in world coordinate system
`/world_left_ee_pose`	`geometry_msgs/PoseStamped`	Left gripper pose in world coordinate system
`/world_right_ee_pose`	`geometry_msgs/PoseStamped`	Right gripper pose in world coordinate system

Camera Data:

Main head RGBD camera: Color + depth images
Left/Right gripper cameras: RealSense RGBD
Fisheye camera: Panoramic view
Left/Right monocular cameras: Stereo vision

Note: If tactile gloves are used, an additional /mocap/touch_data topic will be added.

Click to view original MCAP data format

library:   mcap go v1.7.0                                              
profile:   ros1                                                        
messages:  45200                                                       
duration:  1m5.625866496s                                              
start:     2025-01-15T18:09:29.628202496+08:00 (1736935769.628202496)  
end:       2025-01-15T18:10:35.254068992+08:00 (1736935835.254068992)  
compression:
    zstd: [764/764 chunks] [6.13 GiB/3.84 GiB (37.39%)] [59.87 MiB/sec] 
channels:
    (1)  /rgbd/color/image_raw/compressed                  1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (2)  /joint_states                                     1970 msgs (30.02 Hz)   : sensor_msgs/JointState [ros1msg]       
    (3)  /claws_r_hand                                     1970 msgs (30.02 Hz)   : io_msgs/claws_angle [ros1msg]          
    (4)  /global_localization                              1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
    (5)  /robot_description                                   1 msgs              : std_msgs/String [ros1msg]              
    (6)  /ee_visualization                                 1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (7)  /rgbd/depth/image_raw                             1970 msgs (30.02 Hz)   : sensor_msgs/Image [ros1msg]            
    (8)  /colorized_depth                                  1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (9)  /claws_l_hand                                     1970 msgs (30.02 Hz)   : io_msgs/claws_angle [ros1msg]          
    (10) /claws_touch_data                                 1970 msgs (30.02 Hz)   : io_msgs/squashed_touch [ros1msg]       
    (11) /touch_visualization                              1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (12) /mocap/sensor_data                                1970 msgs (30.02 Hz)   : io_msgs/squashed_mocap_data [ros1msg]  
    (13) /mocap/ros_tf                                     1970 msgs (30.02 Hz)   : tf2_msgs/TFMessage [ros1msg]           
    (14) /left_ee_pose                                     1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
    (15) /right_ee_pose                                    1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
    (16) /usb_cam_left/mjpeg_raw/compressed                1960 msgs (29.87 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (17) /usb_cam_right/mjpeg_raw/compressed               1946 msgs (29.65 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (18) /usb_cam_fisheye/mjpeg_raw/compressed             1957 msgs (29.82 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (19) /realsense_left_hand/depth/image_rect_raw         1961 msgs (29.88 Hz)   : sensor_msgs/Image [ros1msg]            
    (20) /realsense_left_hand/color/image_raw/compressed   1961 msgs (29.88 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (21) /realsense_right_hand/depth/image_rect_raw        1947 msgs (29.67 Hz)   : sensor_msgs/Image [ros1msg]            
    (22) /realsense_right_hand/color/image_raw/compressed  1947 msgs (29.67 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
    (23) /world_left_ee_pose                               1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
    (24) /world_right_ee_pose                              1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
channels: 24
attachments: 0
metadata: 0

Natural Language Annotation

The annotation.json file contains semantic annotation information for tasks, used for training and understanding task intent.

Main Field Descriptions:

Field	Type	Description
`belong_to`	string	Associated data file identifier
`object_set`	array	All objects involved in the task
`scene`	string	Scene identifier
`skill_set`	array	Skill template collection
`subtasks`	array	Subtask sequence
`task_description`	string	Task description

Skill Template Format:

pick {A} from {B} - Pick A from B
place {A} on {B} - Place A on B
toss {A} into {B} - Toss A into B

Subtask Structure:

{
  "skill": "pick {A} from {B}",
  "description": "pick the paper cup from the placemat with the left gripper",
  "description_zh": "左夹爪 从 餐垫 捡起 纸杯",
  "start_frame_id": 159,
  "end_frame_id": 227,
  "start_timestamp": "1736935774906000000",
  "end_timestamp": "1736935777206000000",
  "sequence_id": 1,
  "attempts": "success",
  "comment": ""
}

Click to view complete annotation data example

{
  "belong_to": "20250115_InnerTest_PublicArea_TableClearing_szk_180926",
  "mocap_offset": [],
  "object_set": [
    "paper cup",
    "placemat",
    "trash can",
    "napkin",
    "plate",
    "dinner knife",
    "tableware storage box",
    "wine glass",
    "dinner fork"
  ],
  "scene": "PublicArea",
  "skill_set": [
    "pick {A} from {B}",
    "toss {A} into {B}",
    "place {A} on {B}"
  ],
  "subtasks": [
    {
      "skill": "pick {A} from {B}",
      "description": "pick the paper cup from the placemat with the left gripper",
      "description_zh": "左夹爪 从 餐垫 捡起 纸杯",
      "end_frame_id": 227,
      "end_timestamp": "1736935777206000000",
      "sequence_id": 1,
      "start_frame_id": 159,
      "start_timestamp": "1736935774906000000",
      "comment": "",
      "attempts": "success"
    },
    {
      "skill": "toss {A} into {B}",
      "description": "toss the paper cup into the trash can with the left gripper",
      "description_zh": "左夹爪 扔纸杯进垃圾桶",
      "end_frame_id": 318,
      "end_timestamp": "1736935780244000000",
      "sequence_id": 2,
      "start_frame_id": 231,
      "start_timestamp": "1736935777306000000",
      "comment": "",
      "attempts": "success"
    }
  ],
  "tag_set": [],
  "task_description": "20250115_InnerTest_PublicArea_TableClearing_szk_180926"
}

Teleoperation Robot Data Format

Teleoperation robot data records the process of operators controlling robots through VR devices.

File Structure

f"{robot_name}_{date}_{timestamp}_{sequence_id}"
├── RM_AIDAL_250124_172033_0.mcap    # Multimodal data
├── RM_AIDAL_250124_172033_0.json    # Annotation data
└── RM_AIDAL_250126_093648_0.metadata.yaml  # Metadata

Multimodal Data

Main Topic List:

Topic Name	Data Type	Description
`/camera_01/color/image_raw/compressed`	`sensor_msgs/msg/CompressedImage`	RGB image from main camera
`/camera_02/color/image_raw/compressed`	`sensor_msgs/msg/CompressedImage`	RGB image from left camera
`/camera_03/color/image_raw/compressed`	`sensor_msgs/msg/CompressedImage`	RGB image from right camera
`io_teleop/joint_states`	`sensor_msgs/msg/JointState`	Joint states
`io_teleop/joint_cmd`	`sensor_msgs/msg/JointState`	Joint commands
`io_teleop/target_ee_poses`	`geometry_msgs/msg/PoseArray`	Target end-effector poses
`io_teleop/target_base_move`	`std_msgs/msg/Float64MultiArray`	Target base movement
`io_teleop/target_gripper_status`	`sensor_msgs/msg/JointState`	Target gripper status
`io_teleop/target_joint_from_vr`	`sensor_msgs/msg/JointState`	Joint targets from VR device
`/robot_description`	`std_msgs/msg/String`	Robot URDF description
`/tf`	`tf2_msgs/msg/TFMessage`	TF spatial pose transformation information

Click to view original MCAP data format

Files:             RM_AIDAL_250126_091041_0.mcap
Bag size:          443.3 MiB
Storage id:        mcap
Duration:          100.052164792s
Start:             Jan 24 2025 21:37:32.526605552 (1737725852.526605552)
End:               Jan 24 2025 21:39:12.578770344 (1737725952.578770344)
Messages:          62116
Topic information: Topic: /camera_01/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
                   Topic: /camera_02/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
                   Topic: /camera_03/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
                   Topic: io_teleop/joint_states | Type: sensor_msgs/msg/JointState | Count: 1529 | Serialization Format: cdr
                   Topic: io_teleop/joint_cmd | Type: sensor_msgs/msg/JointState | Count: 10009 | Serialization Format: cdr
                   Topic: io_teleop/target_ee_poses | Type: geometry_msgs/msg/PoseArray | Count: 10014 | Serialization Format: cdr
                   Topic: io_teleop/target_base_move | Type: std_msgs/msg/Float64MultiArray | Count: 10010 | Serialization Format: cdr
                   Topic: io_teleop/target_gripper_status | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
                   Topic: io_teleop/target_joint_from_vr | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
                   Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
                   Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 1529 | Serialization Format: cdr

Natural Language Annotation

The annotation format for teleoperation data is the same as human data, both representing natural language descriptions of what actions robots or humans performed and what objects were involved.

Click to view complete teleoperation annotation data example

{
    "belong_to": "RM_AIDAL_250126_091041_0",
    "mocap_offset": [],
    "object_set": [
        "lemon candy",
        "plate",
        "pistachios"
    ],
    "scene": "250126",
    "skill_set": [
        "place {A} on {B}"
    ],
    "subtasks": [
        {
            "skill": "place {A} on {B}",
            "objecta": "lemon candy",
            "objectb": "plate",
            "options": [
                "leftHand"
            ],
            "description": "place the lemon candy on the plate with the left hand",
            "end_timestamp": "1737725886915000000",
            "sequence_id": 1,
            "start_timestamp": "1737725880757000000",
            "comment": "",
            "attempts": "success"
        },
        {
            "skill": "place {A} on {B}",
            "objecta": "pistachios",
            "objectb": "plate",
            "options": [
                "rightHand"
            ],
            "description": "place the pistachios on the plate with the right hand",
            "end_timestamp": "1737725950745000000",
            "sequence_id": 2,
            "start_timestamp": "1737725941657000000",
            "comment": "",
            "attempts": "success"
        }
    ],
    "tag_set": [],
    "task_description": "20250205_RM_ItemPacking_zhouxw"
}

Export Model Training Data

To facilitate model training, the platform provides multiple data export capabilities, converting original MCAP and JSON data into formats suitable for machine learning training.

Common HDF5 and LeRobot formats can be exported with one click, and different robots or sensor quantities can be automatically adapted without manual configuration.

HDF5 Format

HDF5 format is suitable for large-scale data storage and fast access, using a hierarchical structure to organize data.

File Structure:

chunk_001.hdf5
├── /data/                    # Data group
│   ├── episode_001/         # First task sequence
│   │   ├── action           # Joint commands (multi-dimensional array)
│   │   ├── observation.state # Sensor observations
│   │   ├── observation.gripper # Gripper status
│   │   └── observation.images.* # Multi-view images
│   └── episode_002/         # Second task sequence
└── /meta/                   # Metadata group

Data Content:

action - Joint control commands (float32 array)
observation.state - Sensor observations (float32 array)
observation.images.* - Compressed image data (JPEG format)
observation.gripper - Gripper status (float32 array)
task - English natural language description
task_zh - Chinese natural language description
score - Action quality score

LeRobot Format

LeRobot format is the standard data format in the robot learning field, compatible with mainstream robot learning frameworks.

Reference Sample Data: https://huggingface.co/datasets/io-intelligence/piper_uncap_pen

Data Feature Definitions:

The length and Shape of exported LeRobot datasets are automatically adapted, supporting any number of cameras or joints. The Shape shown here is for the Songling desktop 7-DOF robotic arm export format:

Feature Name	Data Type	Shape	Description
`action`	float32	[14]	Joint commands (7 joints each for left and right arms)
`observation.state`	float32	[14]	Joint states (7 joints each for left and right arms)
`observation.images.cam_high`	image	[3,480,640]	High camera image
`observation.images.cam_low`	image	[3,480,640]	Low camera image
`observation.images.cam_left_wrist`	image	[3,480,640]	Left wrist camera image
`observation.images.cam_right_wrist`	image	[3,480,640]	Right wrist camera image
`timestamp`	float32	[1]	Timestamp
`frame_index`	int64	[1]	Frame index
`episode_index`	int64	[1]	Task sequence index

Click to view complete LeRobot format definition example

{
    "codebase_version": "v2.1",
    "robot_type": "aloha",
    "total_episodes": 10,
    "total_frames": 3000,
    "total_tasks": 1,
    "total_videos": 0,
    "total_chunks": 1,
    "chunks_size": 1000,
    "fps": 15,
    "splits": {
        "train": "0:10"
    },
    "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
    "video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
    "features": {
        "observation.state": {
            "dtype": "float32",
            "shape": [14],
            "names": [
                [
                    "right_waist",
                    "right_shoulder",
                    "right_elbow",
                    "right_forearm_roll",
                    "right_wrist_angle",
                    "right_wrist_rotate",
                    "right_gripper",
                    "left_waist",
                    "left_shoulder",
                    "left_elbow",
                    "left_forearm_roll",
                    "left_wrist_angle",
                    "left_wrist_rotate",
                    "left_gripper"
                ]
            ]
        },
        "action": {
            "dtype": "float32",
            "shape": [14],
            "names": [
                [
                    "right_waist",
                    "right_shoulder",
                    "right_elbow",
                    "right_forearm_roll",
                    "right_wrist_angle",
                    "right_wrist_rotate",
                    "right_gripper",
                    "left_waist",
                    "left_shoulder",
                    "left_elbow",
                    "left_forearm_roll",
                    "left_wrist_angle",
                    "left_wrist_rotate",
                    "left_gripper"
                ]
            ]
        },
        "observation.images.cam_high": {
            "dtype": "image",
            "shape": [3, 480, 640],
            "names": ["channels", "height", "width"]
        },
        "observation.images.cam_low": {
            "dtype": "image",
            "shape": [3, 480, 640],
            "names": ["channels", "height", "width"]
        },
        "observation.images.cam_left_wrist": {
            "dtype": "image",
            "shape": [3, 480, 640],
            "names": ["channels", "height", "width"]
        },
        "observation.images.cam_right_wrist": {
            "dtype": "image",
            "shape": [3, 480, 640],
            "names": ["channels", "height", "width"]
        },
        "timestamp": {
            "dtype": "float32",
            "shape": [1],
            "names": null
        },
        "frame_index": {
            "dtype": "int64",
            "shape": [1],
            "names": null
        },
        "episode_index": {
            "dtype": "int64",
            "shape": [1],
            "names": null
        },
        "index": {
            "dtype": "int64",
            "shape": [1],
            "names": null
        },
        "task_index": {
            "dtype": "int64",
            "shape": [1],
            "names": null
        }
    }
}

Table of Contents​

Human Data Format​

File Structure​

Multimodal Data​

Natural Language Annotation​

Teleoperation Robot Data Format​

File Structure​

Multimodal Data​

Natural Language Annotation​

Export Model Training Data​

HDF5 Format​

LeRobot Format​

Table of Contents

Human Data Format

File Structure

Multimodal Data

Natural Language Annotation

Teleoperation Robot Data Format

File Structure

Multimodal Data

Natural Language Annotation

Export Model Training Data

HDF5 Format

LeRobot Format