Skip to main content

Data Formats

The IO Data Platform supports flexible data formats and allows for custom data visualization templates.

Below is an example using data collected by the IO data acquisition product:

Human Data Format

File Structure

f"{date}_{project}_{scene}_{task}_{staff_id}_{timestamp}"
├── align_result.csv # Timestamp alignment table
├── annotation.json # Annotation data
├── config # Camera and sensor configuration
│   ├── calib_data.yml
│   ├── depth_to_rgb.yml
│   ├── mocap_main.yml
│   ├── orbbec_depth.yml
│   ├── orbbec_rgb.yml
│   └── pose_calib.yml
└── data.mcap # Multimodal data

Multimodal Data

library:   mcap go v1.7.0                                              
profile: ros1
messages: 45200
duration: 1m5.625866496s
start: 2025-01-15T18:09:29.628202496+08:00 (1736935769.628202496)
end: 2025-01-15T18:10:35.254068992+08:00 (1736935835.254068992)
compression:
zstd: [764/764 chunks] [6.13 GiB/3.84 GiB (37.39%)] [59.87 MiB/sec]
channels:
(1) /rgbd/color/image_raw/compressed 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(2) /joint_states 1970 msgs (30.02 Hz) : sensor_msgs/JointState [ros1msg]
(3) /claws_r_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(4) /global_localization 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(5) /robot_description 1 msgs : std_msgs/String [ros1msg]
(6) /ee_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(7) /rgbd/depth/image_raw 1970 msgs (30.02 Hz) : sensor_msgs/Image [ros1msg]
(8) /colorized_depth 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(9) /claws_l_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(10) /claws_touch_data 1970 msgs (30.02 Hz) : io_msgs/squashed_touch [ros1msg]
(11) /touch_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(12) /mocap/sensor_data 1970 msgs (30.02 Hz) : io_msgs/squashed_mocap_data [ros1msg]
(13) /mocap/ros_tf 1970 msgs (30.02 Hz) : tf2_msgs/TFMessage [ros1msg]
(14) /left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(15) /right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(16) /usb_cam_left/mjpeg_raw/compressed 1960 msgs (29.87 Hz) : sensor_msgs/CompressedImage [ros1msg]
(17) /usb_cam_right/mjpeg_raw/compressed 1946 msgs (29.65 Hz) : sensor_msgs/CompressedImage [ros1msg]
(18) /usb_cam_fisheye/mjpeg_raw/compressed 1957 msgs (29.82 Hz) : sensor_msgs/CompressedImage [ros1msg]
(19) /realsense_left_hand/depth/image_rect_raw 1961 msgs (29.88 Hz) : sensor_msgs/Image [ros1msg]
(20) /realsense_left_hand/color/image_raw/compressed 1961 msgs (29.88 Hz) : sensor_msgs/CompressedImage [ros1msg]
(21) /realsense_right_hand/depth/image_rect_raw 1947 msgs (29.67 Hz) : sensor_msgs/Image [ros1msg]
(22) /realsense_right_hand/color/image_raw/compressed 1947 msgs (29.67 Hz) : sensor_msgs/CompressedImage [ros1msg]
(23) /world_left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(24) /world_right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
channels: 24
attachments: 0
metadata: 0
Topic NameDescription
/mocap/sensor_dataJoint velocity, acceleration, angular velocity, rotation angle, and sensor data from motion capture
/mocap/ros_tfTF of all joints from motion capture
/joint_statesJointState of all joints from motion capture
/right_ee_poseRight gripper pose in the main head camera coordinate system
/left_ee_poseLeft gripper pose in the main head camera coordinate system
/claws_l_handLeft gripper closure degree
/claws_r_handRight gripper closure degree
/claws_touch_dataGripper tactile data (contains two messages, each message's frame_id indicates left or right gripper, first four values of data are valid)
/realsense_left_hand/color/image_raw/compressedRGB image from left gripper camera
/realsense_left_hand/depth/image_rect_rawDepth image from left gripper camera
/realsense_right_hand/color/image_raw/compressedRGB image from right gripper camera
/realsense_right_hand/depth/image_rect_rawDepth image from right gripper camera
/rgbd/color/image_raw/compressedRGB image from main head camera
/rgbd/depth/image_rawDepth image from main head camera
/colorized_depthColorized depth image from main head camera
/usb_cam_fisheye/mjpeg_raw/compressedRGB image from main head fisheye camera
/usb_cam_left/mjpeg_raw/compressedRGB image from main head left monocular camera
/usb_cam_right/mjpeg_raw/compressedRGB image from main head right monocular camera
/ee_visualizationEnd-effector pose visualization in main head camera RGB image
/touch_visualizationGripper tactile data visualization
/robot_descriptionMotion capture URDF
/global_localizationMain head camera pose in world coordinate system
/world_left_ee_poseLeft gripper pose in world coordinate system
/world_right_ee_poseRight gripper pose in world coordinate system

If the data is collected with a person wearing tactile gloves, an additional tactile digital signal array topic is included:

/mocap/touch_data 57 msgs (30.25 Hz): io_msgs/squashed_touc [ros1msg]

Natural Language Annotation Data

{
"belong_to": "20250115_InnerTest_PublicArea_TableClearing_szk_180926",
"mocap_offset": [],
"object_set": [
"paper cup",
"placemat",
"trash can",
"napkin",
"plate",
"dinner knife",
"tableware storage box",
"wine glass",
"dinner fork"
],
"scene": "PublicArea",
"skill_set": [
"pick {A} from {B}",
"toss {A} into {B}",
"place {A} on {B}"
],
"subtasks": [
{
"skill": "pick {A} from {B}",
"description": "pick the paper cup from the placemat with the left gripper",
"description_zh": "左夹爪 从 餐垫 捡起 纸杯",
"end_frame_id": 227,
"end_timestamp": "1736935777206000000",
"sequence_id": 1,
"start_frame_id": 159,
"start_timestamp": "1736935774906000000",
"comment": "",
"attempts": "success"
},
{
"skill": "toss {A} into {B}",
"description": "toss the paper cup into the trash can with the left gripper",
"description_zh": "左夹爪 扔纸杯进垃圾桶",
"end_frame_id": 318,
"end_timestamp": "1736935780244000000",
"sequence_id": 2,
"start_frame_id": 231,
"start_timestamp": "1736935777306000000",
"comment": "",
"attempts": "success"
},
...
],
"tag_set": [],
"task_description": "20250115_InnerTest_PublicArea_TableClearing_szk_180926"
}

Teleoperation Robot Data Format

File Structure

f"{robot_name}_{date}_{timestamp}_{sequence_id}"
├── RM_AIDAL_250124_172033_0.mcap
├── RM_AIDAL_250124_172033_0.json
├── RM_AIDAL_250126_093648_0.metadata.yaml

Multimodal Data

Files:             RM_AIDAL_250126_091041_0.mcap
Bag size: 443.3 MiB
Storage id: mcap
Duration: 100.052164792s
Start: Jan 24 2025 21:37:32.526605552 (1737725852.526605552)
End: Jan 24 2025 21:39:12.578770344 (1737725952.578770344)
Messages: 62116
Topic information: Topic: /camera_01/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_02/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_03/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: io_teleop/joint_states | Type: sensor_msgs/msg/JointState | Count: 1529 | Serialization Format: cdr
Topic: io_teleop/joint_cmd | Type: sensor_msgs/msg/JointState | Count: 10009 | Serialization Format: cdr
Topic: io_teleop/target_ee_poses | Type: geometry_msgs/msg/PoseArray | Count: 10014 | Serialization Format: cdr
Topic: io_teleop/target_base_move | Type: std_msgs/msg/Float64MultiArray | Count: 10010 | Serialization Format: cdr
Topic: io_teleop/target_gripper_status | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: io_teleop/target_joint_from_vr | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 1529 | Serialization Format: cdr
Topic NameDescription
/camera_01/color/image_raw/compressedRGB image from main camera
/camera_02/color/image_raw/compressedRGB image from left camera
/camera_03/color/image_raw/compressedRGB image from right camera
io_teleop/joint_statesJoint states
io_teleop/joint_cmdJoint commands
io_teleop/target_ee_posesTarget end-effector poses
io_teleop/target_base_moveTarget base movement
io_teleop/target_gripper_statusTarget gripper status
io_teleop/target_joint_from_vrJoint targets from VR device
/robot_descriptionRobot URDF description
/tfTF spatial pose transformation information

Natural Language Annotation Data

{
"belong_to": "RM_AIDAL_250126_091041_0",
"mocap_offset": [],
"object_set": [
"lemon candy",
"plate",
"pistachios"
],
"scene": "250126",
"skill_set": [
"place {A} on {B}"
],
"subtasks": [
{
"skill": "place {A} on {B}",
"objecta": "lemon candy",
"objectb": "plate",
"options": [
"leftHand"
],
"description": "place the lemon candy on the plate with the left hand",
"end_timestamp": "1737725886915000000",
"sequence_id": 1,
"start_timestamp": "1737725880757000000",
"comment": "",
"attempts": "success"
},
{
"skill": "place {A} on {B}",
"objecta": "pistachios",
"objectb": "plate",
"options": [
"rightHand"
],
"description": "place the pistachios on the plate with the right hand",
"end_timestamp": "1737725950745000000",
"sequence_id": 2,
"start_timestamp": "1737725941657000000",
"comment": "",
"attempts": "success"
}
],
"tag_set": [],
"task_description": "20250205_RM_ItemPacking_zhouxw"
}

Model Training Data

We provide tools to convert the above mcap and json data into Python-parsable formats for direct use in large model training.

HDF5 Format

Below is a basic data example. The actual training data format may vary depending on the original data and customer customization requirements:

/root
├── metadata (Group)
│ ├── creation_time (Attribute)
│ ├── source (Attribute)
│ ├── schema (Dataset)

├── messages (Group)
│ ├── /camera_01/color/image_raw/compressed (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── /camera_02/color/image_raw/compressed (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── /camera_03/color/image_raw/compressed (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/joint_states (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/joint_cmd (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/target_ee_poses (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/target_base_move (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/target_gripper_status (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── io_teleop/target_joint_from_vr (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── /robot_description (Group)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)
│ │
│ ├── /tf (Group)
│ │ ├── timestamps (Dataset)
│ │ ├── data (Dataset)
│ │ ├── schema_id (Attribute)

LeRobot Format

You can refer to our sample dataset: https://huggingface.co/datasets/io-ai-data/DesktopCleanup_RM_AIDAL_demo

{
"codebase_version": "v2.1",
"robot_type": "custom_arm",
"total_episodes": 20,
"total_frames": 5134,
"total_tasks": 20,
"total_videos": 0,
"total_chunks": 1,
"chunks_size": 1000,
"fps": 30,
"splits": {
"train": "0:20"
},
"data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
"video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
"features": {
"observation.images.camera_01": {
"dtype": "image",
"shape": [
480,
640,
3
]
},
"observation.images.camera_02": {
"dtype": "image",
"shape": [
480,
640,
3
]
},
"observation.images.camera_03": {
"dtype": "image",
"shape": [
480,
640,
3
]
},
"observation.images.camera_04": {
"dtype": "image",
"shape": [
480,
640,
3
]
},
"observation.state": {
"dtype": "float64",
"shape": [
37
],
"names": [
"r_joint1",
"r_joint2",
"r_joint3",
"r_joint4",
"r_joint5",
"r_joint6",
"l_joint1",
"l_joint2",
"l_joint3",
"l_joint4",
"l_joint5",
"l_joint6",
"R_thumb_MCP_joint1",
"R_thumb_MCP_joint2",
"R_thumb_PIP_joint",
"R_thumb_DIP_joint",
"R_index_MCP_joint",
"R_index_DIP_joint",
"R_middle_MCP_joint",
"R_middle_DIP_joint",
"R_ring_MCP_joint",
"R_ring_DIP_joint",
"R_pinky_MCP_joint",
"R_pinky_DIP_joint",
"L_thumb_MCP_joint1",
"L_thumb_MCP_joint2",
"L_thumb_PIP_joint",
"L_thumb_DIP_joint",
"L_index_MCP_joint",
"L_index_DIP_joint",
"L_middle_MCP_joint",
"L_middle_DIP_joint",
"L_ring_MCP_joint",
"L_ring_DIP_joint",
"L_pinky_MCP_joint",
"L_pinky_DIP_joint",
"platform_joint"
]
},
"action": {
"dtype": "float64",
"shape": [
12
],
"names": [
"l_joint1",
"l_joint2",
"l_joint3",
"l_joint4",
"l_joint5",
"l_joint6",
"r_joint1",
"r_joint2",
"r_joint3",
"r_joint4",
"r_joint5",
"r_joint6"
]
},
"observation.gripper": {
"dtype": "float64",
"shape": [
2
],
"names": [
"right_gripper",
"left_gripper"
]
},
"timestamp": {
"dtype": "float32",
"shape": [
1
],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"index": {
"dtype": "int64",
"shape": [
1
],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [
1
],
"names": null
}
}
}