Data Formats
The IO Data Platform is designed for universal robot data management, using Robot Operating System (ROS) as the standard for unified robot data management.
- Data Import: Supports automatic conversion of non-ROS standard data from systems like Zhiyuan, Songling, and other data collection systems to ROS standard format for unified management.
- Data Visualization: Built-in visualization models for 30+ mainstream robots, enabling smooth playback of 3D animations and planar images in all formats.
- Data Export: Supports one-click export of standard HDF5/LeRobot data formats, with adaptive joints and images based on original data, ready for direct model training.
Table of Contents
Human Data Format
Human data collection is primarily used to record operator actions and interaction processes, containing multimodal sensor data.
File Structure
Each collection task generates a timestamp-named folder:
f"{date}_{project}_{scene}_{task}_{staff_id}_{timestamp}"
├── align_result.csv # Timestamp alignment table
├── annotation.json # Annotation data
├── config/ # Camera and sensor configuration
│ ├── calib_data.yml
│ ├── depth_to_rgb.yml
│ ├── mocap_main.yml
│ ├── orbbec_depth.yml
│ ├── orbbec_rgb.yml
│ └── pose_calib.yml
└── data.mcap # Multimodal data package
Multimodal Data
The data.mcap
file contains synchronized data from all sensors, stored in MCAP format.
Main Topic List:
Topic Name | Data Type | Description |
---|---|---|
/mocap/sensor_data | io_msgs/squashed_mocap_data | Motion capture joint velocity, acceleration, angular velocity, rotation angle, and sensor data |
/mocap/ros_tf | tf2_msgs/TFMessage | TF transformations for all joints based on motion capture |
/joint_states | sensor_msgs/JointState | JointState for all joints based on motion capture |
/rgbd/color/image_raw/compressed | sensor_msgs/CompressedImage | RGB image from main head camera |
/rgbd/depth/image_raw | sensor_msgs/Image | Depth image from main head camera |
/colorized_depth | sensor_msgs/CompressedImage | Colorized depth image from main head camera |
/left_ee_pose | geometry_msgs/PoseStamped | Left gripper pose in main head camera coordinate system |
/right_ee_pose | geometry_msgs/PoseStamped | Right gripper pose in main head camera coordinate system |
/claws_l_hand | io_msgs/claws_angle | Left gripper closure degree |
/claws_r_hand | io_msgs/claws_angle | Right gripper closure degree |
/claws_touch_data | io_msgs/squashed_touch | Gripper tactile data |
/realsense_left_hand/color/image_raw/compressed | sensor_msgs/CompressedImage | RGB image from left gripper camera |
/realsense_left_hand/depth/image_rect_raw | sensor_msgs/Image | Depth image from left gripper camera |
/realsense_right_hand/color/image_raw/compressed | sensor_msgs/CompressedImage | RGB image from right gripper camera |
/realsense_right_hand/depth/image_rect_raw | sensor_msgs/Image | Depth image from right gripper camera |
/usb_cam_fisheye/mjpeg_raw/compressed | sensor_msgs/CompressedImage | RGB image from main head fisheye camera |
/usb_cam_left/mjpeg_raw/compressed | sensor_msgs/CompressedImage | RGB image from main head left monocular camera |
/usb_cam_right/mjpeg_raw/compressed | sensor_msgs/CompressedImage | RGB image from main head right monocular camera |
/ee_visualization | sensor_msgs/CompressedImage | End-effector pose visualization in main head camera RGB image |
/touch_visualization | sensor_msgs/CompressedImage | Gripper tactile data visualization |
/robot_description | std_msgs/String | Motion capture URDF |
/global_localization | geometry_msgs/PoseStamped | Main head camera pose in world coordinate system |
/world_left_ee_pose | geometry_msgs/PoseStamped | Left gripper pose in world coordinate system |
/world_right_ee_pose | geometry_msgs/PoseStamped | Right gripper pose in world coordinate system |
Camera Data:
- Main head RGBD camera: Color + depth images
- Left/Right gripper cameras: RealSense RGBD
- Fisheye camera: Panoramic view
- Left/Right monocular cameras: Stereo vision
Note: If tactile gloves are used, an additional
/mocap/touch_data
topic will be added.
Click to view original MCAP data format
library: mcap go v1.7.0
profile: ros1
messages: 45200
duration: 1m5.625866496s
start: 2025-01-15T18:09:29.628202496+08:00 (1736935769.628202496)
end: 2025-01-15T18:10:35.254068992+08:00 (1736935835.254068992)
compression:
zstd: [764/764 chunks] [6.13 GiB/3.84 GiB (37.39%)] [59.87 MiB/sec]
channels:
(1) /rgbd/color/image_raw/compressed 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(2) /joint_states 1970 msgs (30.02 Hz) : sensor_msgs/JointState [ros1msg]
(3) /claws_r_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(4) /global_localization 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(5) /robot_description 1 msgs : std_msgs/String [ros1msg]
(6) /ee_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(7) /rgbd/depth/image_raw 1970 msgs (30.02 Hz) : sensor_msgs/Image [ros1msg]
(8) /colorized_depth 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(9) /claws_l_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(10) /claws_touch_data 1970 msgs (30.02 Hz) : io_msgs/squashed_touch [ros1msg]
(11) /touch_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(12) /mocap/sensor_data 1970 msgs (30.02 Hz) : io_msgs/squashed_mocap_data [ros1msg]
(13) /mocap/ros_tf 1970 msgs (30.02 Hz) : tf2_msgs/TFMessage [ros1msg]
(14) /left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(15) /right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(16) /usb_cam_left/mjpeg_raw/compressed 1960 msgs (29.87 Hz) : sensor_msgs/CompressedImage [ros1msg]
(17) /usb_cam_right/mjpeg_raw/compressed 1946 msgs (29.65 Hz) : sensor_msgs/CompressedImage [ros1msg]
(18) /usb_cam_fisheye/mjpeg_raw/compressed 1957 msgs (29.82 Hz) : sensor_msgs/CompressedImage [ros1msg]
(19) /realsense_left_hand/depth/image_rect_raw 1961 msgs (29.88 Hz) : sensor_msgs/Image [ros1msg]
(20) /realsense_left_hand/color/image_raw/compressed 1961 msgs (29.88 Hz) : sensor_msgs/CompressedImage [ros1msg]
(21) /realsense_right_hand/depth/image_rect_raw 1947 msgs (29.67 Hz) : sensor_msgs/Image [ros1msg]
(22) /realsense_right_hand/color/image_raw/compressed 1947 msgs (29.67 Hz) : sensor_msgs/CompressedImage [ros1msg]
(23) /world_left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(24) /world_right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
channels: 24
attachments: 0
metadata: 0
Natural Language Annotation
The annotation.json
file contains semantic annotation information for tasks, used for training and understanding task intent.
Main Field Descriptions:
Field | Type | Description |
---|---|---|
belong_to | string | Associated data file identifier |
object_set | array | All objects involved in the task |
scene | string | Scene identifier |
skill_set | array | Skill template collection |
subtasks | array | Subtask sequence |
task_description | string | Task description |
Skill Template Format:
pick {A} from {B}
- Pick A from Bplace {A} on {B}
- Place A on Btoss {A} into {B}
- Toss A into B
Subtask Structure:
{
"skill": "pick {A} from {B}",
"description": "pick the paper cup from the placemat with the left gripper",
"description_zh": "左夹爪 从 餐垫 捡起 纸杯",
"start_frame_id": 159,
"end_frame_id": 227,
"start_timestamp": "1736935774906000000",
"end_timestamp": "1736935777206000000",
"sequence_id": 1,
"attempts": "success",
"comment": ""
}
Click to view complete annotation data example
{
"belong_to": "20250115_InnerTest_PublicArea_TableClearing_szk_180926",
"mocap_offset": [],
"object_set": [
"paper cup",
"placemat",
"trash can",
"napkin",
"plate",
"dinner knife",
"tableware storage box",
"wine glass",
"dinner fork"
],
"scene": "PublicArea",
"skill_set": [
"pick {A} from {B}",
"toss {A} into {B}",
"place {A} on {B}"
],
"subtasks": [
{
"skill": "pick {A} from {B}",
"description": "pick the paper cup from the placemat with the left gripper",
"description_zh": "左夹爪 从 餐垫 捡起 纸杯",
"end_frame_id": 227,
"end_timestamp": "1736935777206000000",
"sequence_id": 1,
"start_frame_id": 159,
"start_timestamp": "1736935774906000000",
"comment": "",
"attempts": "success"
},
{
"skill": "toss {A} into {B}",
"description": "toss the paper cup into the trash can with the left gripper",
"description_zh": "左夹爪 扔纸杯进垃圾桶",
"end_frame_id": 318,
"end_timestamp": "1736935780244000000",
"sequence_id": 2,
"start_frame_id": 231,
"start_timestamp": "1736935777306000000",
"comment": "",
"attempts": "success"
}
],
"tag_set": [],
"task_description": "20250115_InnerTest_PublicArea_TableClearing_szk_180926"
}
Teleoperation Robot Data Format
Teleoperation robot data records the process of operators controlling robots through VR devices.
File Structure
f"{robot_name}_{date}_{timestamp}_{sequence_id}"
├── RM_AIDAL_250124_172033_0.mcap # Multimodal data
├── RM_AIDAL_250124_172033_0.json # Annotation data
└── RM_AIDAL_250126_093648_0.metadata.yaml # Metadata
Multimodal Data
Main Topic List:
Topic Name | Data Type | Description |
---|---|---|
/camera_01/color/image_raw/compressed | sensor_msgs/msg/CompressedImage | RGB image from main camera |
/camera_02/color/image_raw/compressed | sensor_msgs/msg/CompressedImage | RGB image from left camera |
/camera_03/color/image_raw/compressed | sensor_msgs/msg/CompressedImage | RGB image from right camera |
io_teleop/joint_states | sensor_msgs/msg/JointState | Joint states |
io_teleop/joint_cmd | sensor_msgs/msg/JointState | Joint commands |
io_teleop/target_ee_poses | geometry_msgs/msg/PoseArray | Target end-effector poses |
io_teleop/target_base_move | std_msgs/msg/Float64MultiArray | Target base movement |
io_teleop/target_gripper_status | sensor_msgs/msg/JointState | Target gripper status |
io_teleop/target_joint_from_vr | sensor_msgs/msg/JointState | Joint targets from VR device |
/robot_description | std_msgs/msg/String | Robot URDF description |
/tf | tf2_msgs/msg/TFMessage | TF spatial pose transformation information |
Click to view original MCAP data format
Files: RM_AIDAL_250126_091041_0.mcap
Bag size: 443.3 MiB
Storage id: mcap
Duration: 100.052164792s
Start: Jan 24 2025 21:37:32.526605552 (1737725852.526605552)
End: Jan 24 2025 21:39:12.578770344 (1737725952.578770344)
Messages: 62116
Topic information: Topic: /camera_01/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_02/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_03/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: io_teleop/joint_states | Type: sensor_msgs/msg/JointState | Count: 1529 | Serialization Format: cdr
Topic: io_teleop/joint_cmd | Type: sensor_msgs/msg/JointState | Count: 10009 | Serialization Format: cdr
Topic: io_teleop/target_ee_poses | Type: geometry_msgs/msg/PoseArray | Count: 10014 | Serialization Format: cdr
Topic: io_teleop/target_base_move | Type: std_msgs/msg/Float64MultiArray | Count: 10010 | Serialization Format: cdr
Topic: io_teleop/target_gripper_status | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: io_teleop/target_joint_from_vr | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 1529 | Serialization Format: cdr
Natural Language Annotation
The annotation format for teleoperation data is the same as human data, both representing natural language descriptions of what actions robots or humans performed and what objects were involved.
Click to view complete teleoperation annotation data example
{
"belong_to": "RM_AIDAL_250126_091041_0",
"mocap_offset": [],
"object_set": [
"lemon candy",
"plate",
"pistachios"
],
"scene": "250126",
"skill_set": [
"place {A} on {B}"
],
"subtasks": [
{
"skill": "place {A} on {B}",
"objecta": "lemon candy",
"objectb": "plate",
"options": [
"leftHand"
],
"description": "place the lemon candy on the plate with the left hand",
"end_timestamp": "1737725886915000000",
"sequence_id": 1,
"start_timestamp": "1737725880757000000",
"comment": "",
"attempts": "success"
},
{
"skill": "place {A} on {B}",
"objecta": "pistachios",
"objectb": "plate",
"options": [
"rightHand"
],
"description": "place the pistachios on the plate with the right hand",
"end_timestamp": "1737725950745000000",
"sequence_id": 2,
"start_timestamp": "1737725941657000000",
"comment": "",
"attempts": "success"
}
],
"tag_set": [],
"task_description": "20250205_RM_ItemPacking_zhouxw"
}
Export Model Training Data
To facilitate model training, the platform provides multiple data export capabilities, converting original MCAP and JSON data into formats suitable for machine learning training.
Common HDF5 and LeRobot formats can be exported with one click, and different robots or sensor quantities can be automatically adapted without manual configuration.
HDF5 Format
HDF5 format is suitable for large-scale data storage and fast access, using a hierarchical structure to organize data.
File Structure:
chunk_001.hdf5
├── /data/ # Data group
│ ├── episode_001/ # First task sequence
│ │ ├── action # Joint commands (multi-dimensional array)
│ │ ├── observation.state # Sensor observations
│ │ ├── observation.gripper # Gripper status
│ │ └── observation.images.* # Multi-view images
│ └── episode_002/ # Second task sequence
└── /meta/ # Metadata group
Data Content:
action
- Joint control commands (float32 array)observation.state
- Sensor observations (float32 array)observation.images.*
- Compressed image data (JPEG format)observation.gripper
- Gripper status (float32 array)task
- English natural language descriptiontask_zh
- Chinese natural language descriptionscore
- Action quality score
LeRobot Format
LeRobot format is the standard data format in the robot learning field, compatible with mainstream robot learning frameworks.
Reference Sample Data: https://huggingface.co/datasets/io-ai-data/uncap_pen
Data Feature Definitions:
The length and Shape of exported LeRobot datasets are automatically adapted, supporting any number of cameras or joints. The Shape shown here is for the Songling desktop 7-DOF robotic arm export format:
Feature Name | Data Type | Shape | Description |
---|---|---|---|
action | float32 | [14] | Joint commands (7 joints each for left and right arms) |
observation.state | float32 | [14] | Joint states (7 joints each for left and right arms) |
observation.images.cam_high | image | [3,480,640] | High camera image |
observation.images.cam_low | image | [3,480,640] | Low camera image |
observation.images.cam_left_wrist | image | [3,480,640] | Left wrist camera image |
observation.images.cam_right_wrist | image | [3,480,640] | Right wrist camera image |
timestamp | float32 | [1] | Timestamp |
frame_index | int64 | [1] | Frame index |
episode_index | int64 | [1] | Task sequence index |
Click to view complete LeRobot format definition example
{
"codebase_version": "v2.1",
"robot_type": "aloha",
"total_episodes": 10,
"total_frames": 3000,
"total_tasks": 1,
"total_videos": 0,
"total_chunks": 1,
"chunks_size": 1000,
"fps": 15,
"splits": {
"train": "0:10"
},
"data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
"video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
"features": {
"observation.state": {
"dtype": "float32",
"shape": [14],
"names": [
[
"right_waist",
"right_shoulder",
"right_elbow",
"right_forearm_roll",
"right_wrist_angle",
"right_wrist_rotate",
"right_gripper",
"left_waist",
"left_shoulder",
"left_elbow",
"left_forearm_roll",
"left_wrist_angle",
"left_wrist_rotate",
"left_gripper"
]
]
},
"action": {
"dtype": "float32",
"shape": [14],
"names": [
[
"right_waist",
"right_shoulder",
"right_elbow",
"right_forearm_roll",
"right_wrist_angle",
"right_wrist_rotate",
"right_gripper",
"left_waist",
"left_shoulder",
"left_elbow",
"left_forearm_roll",
"left_wrist_angle",
"left_wrist_rotate",
"left_gripper"
]
]
},
"observation.images.cam_high": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_low": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_left_wrist": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_right_wrist": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"timestamp": {
"dtype": "float32",
"shape": [1],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [1],
"names": null
}
}
}