Data Formats

The IO Data Platform supports flexible data formats and allows for custom data visualization templates.

Below is an example using data collected by the IO data acquisition product:

Human Data Format

File Structure

f"{date}_{project}_{scene}_{task}_{staff_id}_{timestamp}"
├── align_result.csv # Timestamp alignment table
├── annotation.json  # Annotation data
├── config           # Camera and sensor configuration
│   ├── calib_data.yml
│   ├── depth_to_rgb.yml
│   ├── mocap_main.yml
│   ├── orbbec_depth.yml
│   ├── orbbec_rgb.yml
│   └── pose_calib.yml
└── data.mcap        # Multimodal data

Multimodal Data

library:   mcap go v1.7.0                                              
profile:   ros1                                                        
messages:  45200                                                       
duration:  1m5.625866496s                                              
start:     2025-01-15T18:09:29.628202496+08:00 (1736935769.628202496)  
end:       2025-01-15T18:10:35.254068992+08:00 (1736935835.254068992)  
compression:
  zstd: [764/764 chunks] [6.13 GiB/3.84 GiB (37.39%)] [59.87 MiB/sec] 
channels:
  (1)  /rgbd/color/image_raw/compressed                  1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (2)  /joint_states                                     1970 msgs (30.02 Hz)   : sensor_msgs/JointState [ros1msg]       
  (3)  /claws_r_hand                                     1970 msgs (30.02 Hz)   : io_msgs/claws_angle [ros1msg]          
  (4)  /global_localization                              1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
  (5)  /robot_description                                   1 msgs              : std_msgs/String [ros1msg]              
  (6)  /ee_visualization                                 1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (7)  /rgbd/depth/image_raw                             1970 msgs (30.02 Hz)   : sensor_msgs/Image [ros1msg]            
  (8)  /colorized_depth                                  1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (9)  /claws_l_hand                                     1970 msgs (30.02 Hz)   : io_msgs/claws_angle [ros1msg]          
  (10) /claws_touch_data                                 1970 msgs (30.02 Hz)   : io_msgs/squashed_touch [ros1msg]       
  (11) /touch_visualization                              1970 msgs (30.02 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (12) /mocap/sensor_data                                1970 msgs (30.02 Hz)   : io_msgs/squashed_mocap_data [ros1msg]  
  (13) /mocap/ros_tf                                     1970 msgs (30.02 Hz)   : tf2_msgs/TFMessage [ros1msg]           
  (14) /left_ee_pose                                     1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
  (15) /right_ee_pose                                    1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
  (16) /usb_cam_left/mjpeg_raw/compressed                1960 msgs (29.87 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (17) /usb_cam_right/mjpeg_raw/compressed               1946 msgs (29.65 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (18) /usb_cam_fisheye/mjpeg_raw/compressed             1957 msgs (29.82 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (19) /realsense_left_hand/depth/image_rect_raw         1961 msgs (29.88 Hz)   : sensor_msgs/Image [ros1msg]            
  (20) /realsense_left_hand/color/image_raw/compressed   1961 msgs (29.88 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (21) /realsense_right_hand/depth/image_rect_raw        1947 msgs (29.67 Hz)   : sensor_msgs/Image [ros1msg]            
  (22) /realsense_right_hand/color/image_raw/compressed  1947 msgs (29.67 Hz)   : sensor_msgs/CompressedImage [ros1msg]  
  (23) /world_left_ee_pose                               1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
  (24) /world_right_ee_pose                              1970 msgs (30.02 Hz)   : geometry_msgs/PoseStamped [ros1msg]    
channels: 24
attachments: 0
metadata: 0

Topic Name	Description
/mocap/sensor_data	Joint velocity, acceleration, angular velocity, rotation angle, and sensor data from motion capture
/mocap/ros_tf	TF of all joints from motion capture
/joint_states	JointState of all joints from motion capture
/right_ee_pose	Right gripper pose in the main head camera coordinate system
/left_ee_pose	Left gripper pose in the main head camera coordinate system
/claws_l_hand	Left gripper closure degree
/claws_r_hand	Right gripper closure degree
/claws_touch_data	Gripper tactile data (contains two messages, each message's frame_id indicates left or right gripper, first four values of data are valid)
/realsense_left_hand/color/image_raw/compressed	RGB image from left gripper camera
/realsense_left_hand/depth/image_rect_raw	Depth image from left gripper camera
/realsense_right_hand/color/image_raw/compressed	RGB image from right gripper camera
/realsense_right_hand/depth/image_rect_raw	Depth image from right gripper camera
/rgbd/color/image_raw/compressed	RGB image from main head camera
/rgbd/depth/image_raw	Depth image from main head camera
/colorized_depth	Colorized depth image from main head camera
/usb_cam_fisheye/mjpeg_raw/compressed	RGB image from main head fisheye camera
/usb_cam_left/mjpeg_raw/compressed	RGB image from main head left monocular camera
/usb_cam_right/mjpeg_raw/compressed	RGB image from main head right monocular camera
/ee_visualization	End-effector pose visualization in main head camera RGB image
/touch_visualization	Gripper tactile data visualization
/robot_description	Motion capture URDF
/global_localization	Main head camera pose in world coordinate system
/world_left_ee_pose	Left gripper pose in world coordinate system
/world_right_ee_pose	Right gripper pose in world coordinate system

If the data is collected with a person wearing tactile gloves, an additional tactile digital signal array topic is included:

/mocap/touch_data 57 msgs (30.25 Hz): io_msgs/squashed_touc [ros1msg]

Natural Language Annotation Data

{
  "belong_to": "20250115_InnerTest_PublicArea_TableClearing_szk_180926",
  "mocap_offset": [],
  "object_set": [
  "paper cup",
  "placemat",
  "trash can",
  "napkin",
  "plate",
  "dinner knife",
  "tableware storage box",
  "wine glass",
  "dinner fork"
  ],
  "scene": "PublicArea",
  "skill_set": [
  "pick {A} from {B}",
  "toss {A} into {B}",
  "place {A} on {B}"
  ],
  "subtasks": [
  {
    "skill": "pick {A} from {B}",
    "description": "pick the paper cup from the placemat with the left gripper",
    "description_zh": "左夹爪 从 餐垫 捡起 纸杯",
    "end_frame_id": 227,
    "end_timestamp": "1736935777206000000",
    "sequence_id": 1,
    "start_frame_id": 159,
    "start_timestamp": "1736935774906000000",
    "comment": "",
    "attempts": "success"
  },
  {
    "skill": "toss {A} into {B}",
    "description": "toss the paper cup into the trash can with the left gripper",
    "description_zh": "左夹爪 扔纸杯进垃圾桶",
    "end_frame_id": 318,
    "end_timestamp": "1736935780244000000",
    "sequence_id": 2,
    "start_frame_id": 231,
    "start_timestamp": "1736935777306000000",
    "comment": "",
    "attempts": "success"
  },
  ...
  ],
  "tag_set": [],
  "task_description": "20250115_InnerTest_PublicArea_TableClearing_szk_180926"
}

Teleoperation Robot Data Format

File Structure

f"{robot_name}_{date}_{timestamp}_{sequence_id}"
├── RM_AIDAL_250124_172033_0.mcap
├── RM_AIDAL_250124_172033_0.json
├── RM_AIDAL_250126_093648_0.metadata.yaml

Multimodal Data

Files:             RM_AIDAL_250126_091041_0.mcap
Bag size:          443.3 MiB
Storage id:        mcap
Duration:          100.052164792s
Start:             Jan 24 2025 21:37:32.526605552 (1737725852.526605552)
End:               Jan 24 2025 21:39:12.578770344 (1737725952.578770344)
Messages:          62116
Topic information: Topic: /camera_01/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
           Topic: /camera_02/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
           Topic: /camera_03/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
           Topic: io_teleop/joint_states | Type: sensor_msgs/msg/JointState | Count: 1529 | Serialization Format: cdr
           Topic: io_teleop/joint_cmd | Type: sensor_msgs/msg/JointState | Count: 10009 | Serialization Format: cdr
           Topic: io_teleop/target_ee_poses | Type: geometry_msgs/msg/PoseArray | Count: 10014 | Serialization Format: cdr
           Topic: io_teleop/target_base_move | Type: std_msgs/msg/Float64MultiArray | Count: 10010 | Serialization Format: cdr
           Topic: io_teleop/target_gripper_status | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
           Topic: io_teleop/target_joint_from_vr | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
           Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
           Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 1529 | Serialization Format: cdr

Topic Name	Description
/camera_01/color/image_raw/compressed	RGB image from main camera
/camera_02/color/image_raw/compressed	RGB image from left camera
/camera_03/color/image_raw/compressed	RGB image from right camera
io_teleop/joint_states	Joint states
io_teleop/joint_cmd	Joint commands
io_teleop/target_ee_poses	Target end-effector poses
io_teleop/target_base_move	Target base movement
io_teleop/target_gripper_status	Target gripper status
io_teleop/target_joint_from_vr	Joint targets from VR device
/robot_description	Robot URDF description
/tf	TF spatial pose transformation information

Natural Language Annotation Data

{
  "belong_to": "RM_AIDAL_250126_091041_0",
  "mocap_offset": [],
  "object_set": [
    "lemon candy",
    "plate",
    "pistachios"
  ],
  "scene": "250126",
  "skill_set": [
    "place {A} on {B}"
  ],
  "subtasks": [
    {
      "skill": "place {A} on {B}",
      "objecta": "lemon candy",
      "objectb": "plate",
      "options": [
        "leftHand"
      ],
      "description": "place the lemon candy on the plate with the left hand",
      "end_timestamp": "1737725886915000000",
      "sequence_id": 1,
      "start_timestamp": "1737725880757000000",
      "comment": "",
      "attempts": "success"
    },
    {
      "skill": "place {A} on {B}",
      "objecta": "pistachios",
      "objectb": "plate",
      "options": [
        "rightHand"
      ],
      "description": "place the pistachios on the plate with the right hand",
      "end_timestamp": "1737725950745000000",
      "sequence_id": 2,
      "start_timestamp": "1737725941657000000",
      "comment": "",
      "attempts": "success"
    }
  ],
  "tag_set": [],
  "task_description": "20250205_RM_ItemPacking_zhouxw"
}

Model Training Data

We provide tools to convert the above mcap and json data into Python-parsable formats for direct use in large model training.

HDF5 Format

Below is a basic data example. The actual training data format may vary depending on the original data and customer customization requirements:

/root
  ├── metadata (Group)
  │     ├── creation_time (Attribute)
  │     ├── source (Attribute)
  │     ├── schema (Dataset)
  │
  ├── messages (Group)
  │     ├── /camera_01/color/image_raw/compressed (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── /camera_02/color/image_raw/compressed (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── /camera_03/color/image_raw/compressed (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/joint_states (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/joint_cmd (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/target_ee_poses (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/target_base_move (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/target_gripper_status (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── io_teleop/target_joint_from_vr (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── /robot_description (Group)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)
  │     │
  │     ├── /tf (Group)
  │     │     ├── timestamps (Dataset)
  │     │     ├── data (Dataset)
  │     │     ├── schema_id (Attribute)

LeRobot Format

You can refer to our sample dataset: https://huggingface.co/datasets/io-ai-data/DesktopCleanup_RM_AIDAL_demo

{
  "codebase_version": "v2.1",
  "robot_type": "custom_arm",
  "total_episodes": 20,
  "total_frames": 5134,
  "total_tasks": 20,
  "total_videos": 0,
  "total_chunks": 1,
  "chunks_size": 1000,
  "fps": 30,
  "splits": {
    "train": "0:20"
  },
  "data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
  "video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
  "features": {
    "observation.images.camera_01": {
      "dtype": "image",
      "shape": [
        480,
        640,
        3
      ]
    },
    "observation.images.camera_02": {
      "dtype": "image",
      "shape": [
        480,
        640,
        3
      ]
    },
    "observation.images.camera_03": {
      "dtype": "image",
      "shape": [
        480,
        640,
        3
      ]
    },
    "observation.images.camera_04": {
      "dtype": "image",
      "shape": [
        480,
        640,
        3
      ]
    },
    "observation.state": {
      "dtype": "float64",
      "shape": [
        37
      ],
      "names": [
        "r_joint1",
        "r_joint2",
        "r_joint3",
        "r_joint4",
        "r_joint5",
        "r_joint6",
        "l_joint1",
        "l_joint2",
        "l_joint3",
        "l_joint4",
        "l_joint5",
        "l_joint6",
        "R_thumb_MCP_joint1",
        "R_thumb_MCP_joint2",
        "R_thumb_PIP_joint",
        "R_thumb_DIP_joint",
        "R_index_MCP_joint",
        "R_index_DIP_joint",
        "R_middle_MCP_joint",
        "R_middle_DIP_joint",
        "R_ring_MCP_joint",
        "R_ring_DIP_joint",
        "R_pinky_MCP_joint",
        "R_pinky_DIP_joint",
        "L_thumb_MCP_joint1",
        "L_thumb_MCP_joint2",
        "L_thumb_PIP_joint",
        "L_thumb_DIP_joint",
        "L_index_MCP_joint",
        "L_index_DIP_joint",
        "L_middle_MCP_joint",
        "L_middle_DIP_joint",
        "L_ring_MCP_joint",
        "L_ring_DIP_joint",
        "L_pinky_MCP_joint",
        "L_pinky_DIP_joint",
        "platform_joint"
      ]
    },
    "action": {
      "dtype": "float64",
      "shape": [
        12
      ],
      "names": [
        "l_joint1",
        "l_joint2",
        "l_joint3",
        "l_joint4",
        "l_joint5",
        "l_joint6",
        "r_joint1",
        "r_joint2",
        "r_joint3",
        "r_joint4",
        "r_joint5",
        "r_joint6"
      ]
    },
    "observation.gripper": {
      "dtype": "float64",
      "shape": [
        2
      ],
      "names": [
        "right_gripper",
        "left_gripper"
      ]
    },
    "timestamp": {
      "dtype": "float32",
      "shape": [
        1
      ],
      "names": null
    },
    "frame_index": {
      "dtype": "int64",
      "shape": [
        1
      ],
      "names": null
    },
    "episode_index": {
      "dtype": "int64",
      "shape": [
        1
      ],
      "names": null
    },
    "index": {
      "dtype": "int64",
      "shape": [
        1
      ],
      "names": null
    },
    "task_index": {
      "dtype": "int64",
      "shape": [
        1
      ],
      "names": null
    }
  }
}

Human Data Format​

File Structure​

Multimodal Data​

Natural Language Annotation Data​

Teleoperation Robot Data Format​

File Structure​

Multimodal Data​

Natural Language Annotation Data​

Model Training Data​

HDF5 Format​

LeRobot Format​

Human Data Format

File Structure

Multimodal Data

Natural Language Annotation Data

Teleoperation Robot Data Format

File Structure

Multimodal Data

Natural Language Annotation Data

Model Training Data

HDF5 Format

LeRobot Format