Skip to main content

Data Formats

The IO Data Platform is designed for universal robot data management, using Robot Operating System (ROS) as the standard for unified robot data management.

  1. Data Import: Supports automatic conversion of non-ROS standard data from systems like Zhiyuan, Songling, and other data collection systems to ROS standard format for unified management.
  2. Data Visualization: Built-in visualization models for 30+ mainstream robots, enabling smooth playback of 3D animations and planar images in all formats.
  3. Data Export: Supports one-click export of standard HDF5/LeRobot data formats, with adaptive joints and images based on original data, ready for direct model training.

Table of Contents


Human Data Format

Human data collection is primarily used to record operator actions and interaction processes, containing multimodal sensor data.

File Structure

Each collection task generates a timestamp-named folder:

f"{date}_{project}_{scene}_{task}_{staff_id}_{timestamp}"
├── align_result.csv # Timestamp alignment table
├── annotation.json # Annotation data
├── config/ # Camera and sensor configuration
│ ├── calib_data.yml
│ ├── depth_to_rgb.yml
│ ├── mocap_main.yml
│ ├── orbbec_depth.yml
│ ├── orbbec_rgb.yml
│ └── pose_calib.yml
└── data.mcap # Multimodal data package

Multimodal Data

The data.mcap file contains synchronized data from all sensors, stored in MCAP format.

Main Topic List:

Topic NameData TypeDescription
/mocap/sensor_dataio_msgs/squashed_mocap_dataMotion capture joint velocity, acceleration, angular velocity, rotation angle, and sensor data
/mocap/ros_tftf2_msgs/TFMessageTF transformations for all joints based on motion capture
/joint_statessensor_msgs/JointStateJointState for all joints based on motion capture
/rgbd/color/image_raw/compressedsensor_msgs/CompressedImageRGB image from main head camera
/rgbd/depth/image_rawsensor_msgs/ImageDepth image from main head camera
/colorized_depthsensor_msgs/CompressedImageColorized depth image from main head camera
/left_ee_posegeometry_msgs/PoseStampedLeft gripper pose in main head camera coordinate system
/right_ee_posegeometry_msgs/PoseStampedRight gripper pose in main head camera coordinate system
/claws_l_handio_msgs/claws_angleLeft gripper closure degree
/claws_r_handio_msgs/claws_angleRight gripper closure degree
/claws_touch_dataio_msgs/squashed_touchGripper tactile data
/realsense_left_hand/color/image_raw/compressedsensor_msgs/CompressedImageRGB image from left gripper camera
/realsense_left_hand/depth/image_rect_rawsensor_msgs/ImageDepth image from left gripper camera
/realsense_right_hand/color/image_raw/compressedsensor_msgs/CompressedImageRGB image from right gripper camera
/realsense_right_hand/depth/image_rect_rawsensor_msgs/ImageDepth image from right gripper camera
/usb_cam_fisheye/mjpeg_raw/compressedsensor_msgs/CompressedImageRGB image from main head fisheye camera
/usb_cam_left/mjpeg_raw/compressedsensor_msgs/CompressedImageRGB image from main head left monocular camera
/usb_cam_right/mjpeg_raw/compressedsensor_msgs/CompressedImageRGB image from main head right monocular camera
/ee_visualizationsensor_msgs/CompressedImageEnd-effector pose visualization in main head camera RGB image
/touch_visualizationsensor_msgs/CompressedImageGripper tactile data visualization
/robot_descriptionstd_msgs/StringMotion capture URDF
/global_localizationgeometry_msgs/PoseStampedMain head camera pose in world coordinate system
/world_left_ee_posegeometry_msgs/PoseStampedLeft gripper pose in world coordinate system
/world_right_ee_posegeometry_msgs/PoseStampedRight gripper pose in world coordinate system

Camera Data:

  • Main head RGBD camera: Color + depth images
  • Left/Right gripper cameras: RealSense RGBD
  • Fisheye camera: Panoramic view
  • Left/Right monocular cameras: Stereo vision

Note: If tactile gloves are used, an additional /mocap/touch_data topic will be added.

Click to view original MCAP data format
library:   mcap go v1.7.0                                              
profile: ros1
messages: 45200
duration: 1m5.625866496s
start: 2025-01-15T18:09:29.628202496+08:00 (1736935769.628202496)
end: 2025-01-15T18:10:35.254068992+08:00 (1736935835.254068992)
compression:
zstd: [764/764 chunks] [6.13 GiB/3.84 GiB (37.39%)] [59.87 MiB/sec]
channels:
(1) /rgbd/color/image_raw/compressed 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(2) /joint_states 1970 msgs (30.02 Hz) : sensor_msgs/JointState [ros1msg]
(3) /claws_r_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(4) /global_localization 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(5) /robot_description 1 msgs : std_msgs/String [ros1msg]
(6) /ee_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(7) /rgbd/depth/image_raw 1970 msgs (30.02 Hz) : sensor_msgs/Image [ros1msg]
(8) /colorized_depth 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(9) /claws_l_hand 1970 msgs (30.02 Hz) : io_msgs/claws_angle [ros1msg]
(10) /claws_touch_data 1970 msgs (30.02 Hz) : io_msgs/squashed_touch [ros1msg]
(11) /touch_visualization 1970 msgs (30.02 Hz) : sensor_msgs/CompressedImage [ros1msg]
(12) /mocap/sensor_data 1970 msgs (30.02 Hz) : io_msgs/squashed_mocap_data [ros1msg]
(13) /mocap/ros_tf 1970 msgs (30.02 Hz) : tf2_msgs/TFMessage [ros1msg]
(14) /left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(15) /right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(16) /usb_cam_left/mjpeg_raw/compressed 1960 msgs (29.87 Hz) : sensor_msgs/CompressedImage [ros1msg]
(17) /usb_cam_right/mjpeg_raw/compressed 1946 msgs (29.65 Hz) : sensor_msgs/CompressedImage [ros1msg]
(18) /usb_cam_fisheye/mjpeg_raw/compressed 1957 msgs (29.82 Hz) : sensor_msgs/CompressedImage [ros1msg]
(19) /realsense_left_hand/depth/image_rect_raw 1961 msgs (29.88 Hz) : sensor_msgs/Image [ros1msg]
(20) /realsense_left_hand/color/image_raw/compressed 1961 msgs (29.88 Hz) : sensor_msgs/CompressedImage [ros1msg]
(21) /realsense_right_hand/depth/image_rect_raw 1947 msgs (29.67 Hz) : sensor_msgs/Image [ros1msg]
(22) /realsense_right_hand/color/image_raw/compressed 1947 msgs (29.67 Hz) : sensor_msgs/CompressedImage [ros1msg]
(23) /world_left_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
(24) /world_right_ee_pose 1970 msgs (30.02 Hz) : geometry_msgs/PoseStamped [ros1msg]
channels: 24
attachments: 0
metadata: 0

Natural Language Annotation

The annotation.json file contains semantic annotation information for tasks, used for training and understanding task intent.

Main Field Descriptions:

FieldTypeDescription
belong_tostringAssociated data file identifier
object_setarrayAll objects involved in the task
scenestringScene identifier
skill_setarraySkill template collection
subtasksarraySubtask sequence
task_descriptionstringTask description

Skill Template Format:

  • pick {A} from {B} - Pick A from B
  • place {A} on {B} - Place A on B
  • toss {A} into {B} - Toss A into B

Subtask Structure:

{
"skill": "pick {A} from {B}",
"description": "pick the paper cup from the placemat with the left gripper",
"description_zh": "左夹爪 从 餐垫 捡起 纸杯",
"start_frame_id": 159,
"end_frame_id": 227,
"start_timestamp": "1736935774906000000",
"end_timestamp": "1736935777206000000",
"sequence_id": 1,
"attempts": "success",
"comment": ""
}
Click to view complete annotation data example
{
"belong_to": "20250115_InnerTest_PublicArea_TableClearing_szk_180926",
"mocap_offset": [],
"object_set": [
"paper cup",
"placemat",
"trash can",
"napkin",
"plate",
"dinner knife",
"tableware storage box",
"wine glass",
"dinner fork"
],
"scene": "PublicArea",
"skill_set": [
"pick {A} from {B}",
"toss {A} into {B}",
"place {A} on {B}"
],
"subtasks": [
{
"skill": "pick {A} from {B}",
"description": "pick the paper cup from the placemat with the left gripper",
"description_zh": "左夹爪 从 餐垫 捡起 纸杯",
"end_frame_id": 227,
"end_timestamp": "1736935777206000000",
"sequence_id": 1,
"start_frame_id": 159,
"start_timestamp": "1736935774906000000",
"comment": "",
"attempts": "success"
},
{
"skill": "toss {A} into {B}",
"description": "toss the paper cup into the trash can with the left gripper",
"description_zh": "左夹爪 扔纸杯进垃圾桶",
"end_frame_id": 318,
"end_timestamp": "1736935780244000000",
"sequence_id": 2,
"start_frame_id": 231,
"start_timestamp": "1736935777306000000",
"comment": "",
"attempts": "success"
}
],
"tag_set": [],
"task_description": "20250115_InnerTest_PublicArea_TableClearing_szk_180926"
}

Teleoperation Robot Data Format

Teleoperation robot data records the process of operators controlling robots through VR devices.

File Structure

f"{robot_name}_{date}_{timestamp}_{sequence_id}"
├── RM_AIDAL_250124_172033_0.mcap # Multimodal data
├── RM_AIDAL_250124_172033_0.json # Annotation data
└── RM_AIDAL_250126_093648_0.metadata.yaml # Metadata

Multimodal Data

Main Topic List:

Topic NameData TypeDescription
/camera_01/color/image_raw/compressedsensor_msgs/msg/CompressedImageRGB image from main camera
/camera_02/color/image_raw/compressedsensor_msgs/msg/CompressedImageRGB image from left camera
/camera_03/color/image_raw/compressedsensor_msgs/msg/CompressedImageRGB image from right camera
io_teleop/joint_statessensor_msgs/msg/JointStateJoint states
io_teleop/joint_cmdsensor_msgs/msg/JointStateJoint commands
io_teleop/target_ee_posesgeometry_msgs/msg/PoseArrayTarget end-effector poses
io_teleop/target_base_movestd_msgs/msg/Float64MultiArrayTarget base movement
io_teleop/target_gripper_statussensor_msgs/msg/JointStateTarget gripper status
io_teleop/target_joint_from_vrsensor_msgs/msg/JointStateJoint targets from VR device
/robot_descriptionstd_msgs/msg/StringRobot URDF description
/tftf2_msgs/msg/TFMessageTF spatial pose transformation information
Click to view original MCAP data format
Files:             RM_AIDAL_250126_091041_0.mcap
Bag size: 443.3 MiB
Storage id: mcap
Duration: 100.052164792s
Start: Jan 24 2025 21:37:32.526605552 (1737725852.526605552)
End: Jan 24 2025 21:39:12.578770344 (1737725952.578770344)
Messages: 62116
Topic information: Topic: /camera_01/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_02/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: /camera_03/color/image_raw/compressed | Type: sensor_msgs/msg/CompressedImage | Count: 3000 | Serialization Format: cdr
Topic: io_teleop/joint_states | Type: sensor_msgs/msg/JointState | Count: 1529 | Serialization Format: cdr
Topic: io_teleop/joint_cmd | Type: sensor_msgs/msg/JointState | Count: 10009 | Serialization Format: cdr
Topic: io_teleop/target_ee_poses | Type: geometry_msgs/msg/PoseArray | Count: 10014 | Serialization Format: cdr
Topic: io_teleop/target_base_move | Type: std_msgs/msg/Float64MultiArray | Count: 10010 | Serialization Format: cdr
Topic: io_teleop/target_gripper_status | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: io_teleop/target_joint_from_vr | Type: sensor_msgs/msg/JointState | Count: 10012 | Serialization Format: cdr
Topic: /robot_description | Type: std_msgs/msg/String | Count: 1 | Serialization Format: cdr
Topic: /tf | Type: tf2_msgs/msg/TFMessage | Count: 1529 | Serialization Format: cdr

Natural Language Annotation

The annotation format for teleoperation data is the same as human data, both representing natural language descriptions of what actions robots or humans performed and what objects were involved.

Click to view complete teleoperation annotation data example
{
"belong_to": "RM_AIDAL_250126_091041_0",
"mocap_offset": [],
"object_set": [
"lemon candy",
"plate",
"pistachios"
],
"scene": "250126",
"skill_set": [
"place {A} on {B}"
],
"subtasks": [
{
"skill": "place {A} on {B}",
"objecta": "lemon candy",
"objectb": "plate",
"options": [
"leftHand"
],
"description": "place the lemon candy on the plate with the left hand",
"end_timestamp": "1737725886915000000",
"sequence_id": 1,
"start_timestamp": "1737725880757000000",
"comment": "",
"attempts": "success"
},
{
"skill": "place {A} on {B}",
"objecta": "pistachios",
"objectb": "plate",
"options": [
"rightHand"
],
"description": "place the pistachios on the plate with the right hand",
"end_timestamp": "1737725950745000000",
"sequence_id": 2,
"start_timestamp": "1737725941657000000",
"comment": "",
"attempts": "success"
}
],
"tag_set": [],
"task_description": "20250205_RM_ItemPacking_zhouxw"
}

Export Model Training Data

To facilitate model training, the platform provides multiple data export capabilities, converting original MCAP and JSON data into formats suitable for machine learning training.

Common HDF5 and LeRobot formats can be exported with one click, and different robots or sensor quantities can be automatically adapted without manual configuration.

HDF5 Format

HDF5 format is suitable for large-scale data storage and fast access, using a hierarchical structure to organize data.

File Structure:

chunk_001.hdf5
├── /data/ # Data group
│ ├── episode_001/ # First task sequence
│ │ ├── action # Joint commands (multi-dimensional array)
│ │ ├── observation.state # Sensor observations
│ │ ├── observation.gripper # Gripper status
│ │ └── observation.images.* # Multi-view images
│ └── episode_002/ # Second task sequence
└── /meta/ # Metadata group

Data Content:

  • action - Joint control commands (float32 array)
  • observation.state - Sensor observations (float32 array)
  • observation.images.* - Compressed image data (JPEG format)
  • observation.gripper - Gripper status (float32 array)
  • task - English natural language description
  • task_zh - Chinese natural language description
  • score - Action quality score

LeRobot Format

LeRobot format is the standard data format in the robot learning field, compatible with mainstream robot learning frameworks.

Reference Sample Data: https://huggingface.co/datasets/io-ai-data/uncap_pen

Data Feature Definitions:

The length and Shape of exported LeRobot datasets are automatically adapted, supporting any number of cameras or joints. The Shape shown here is for the Songling desktop 7-DOF robotic arm export format:

Feature NameData TypeShapeDescription
actionfloat32[14]Joint commands (7 joints each for left and right arms)
observation.statefloat32[14]Joint states (7 joints each for left and right arms)
observation.images.cam_highimage[3,480,640]High camera image
observation.images.cam_lowimage[3,480,640]Low camera image
observation.images.cam_left_wristimage[3,480,640]Left wrist camera image
observation.images.cam_right_wristimage[3,480,640]Right wrist camera image
timestampfloat32[1]Timestamp
frame_indexint64[1]Frame index
episode_indexint64[1]Task sequence index
Click to view complete LeRobot format definition example
{
"codebase_version": "v2.1",
"robot_type": "aloha",
"total_episodes": 10,
"total_frames": 3000,
"total_tasks": 1,
"total_videos": 0,
"total_chunks": 1,
"chunks_size": 1000,
"fps": 15,
"splits": {
"train": "0:10"
},
"data_path": "data/chunk-{episode_chunk:03d}/episode_{episode_index:06d}.parquet",
"video_path": "videos/chunk-{episode_chunk:03d}/{video_key}/episode_{episode_index:06d}.mp4",
"features": {
"observation.state": {
"dtype": "float32",
"shape": [14],
"names": [
[
"right_waist",
"right_shoulder",
"right_elbow",
"right_forearm_roll",
"right_wrist_angle",
"right_wrist_rotate",
"right_gripper",
"left_waist",
"left_shoulder",
"left_elbow",
"left_forearm_roll",
"left_wrist_angle",
"left_wrist_rotate",
"left_gripper"
]
]
},
"action": {
"dtype": "float32",
"shape": [14],
"names": [
[
"right_waist",
"right_shoulder",
"right_elbow",
"right_forearm_roll",
"right_wrist_angle",
"right_wrist_rotate",
"right_gripper",
"left_waist",
"left_shoulder",
"left_elbow",
"left_forearm_roll",
"left_wrist_angle",
"left_wrist_rotate",
"left_gripper"
]
]
},
"observation.images.cam_high": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_low": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_left_wrist": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"observation.images.cam_right_wrist": {
"dtype": "image",
"shape": [3, 480, 640],
"names": ["channels", "height", "width"]
},
"timestamp": {
"dtype": "float32",
"shape": [1],
"names": null
},
"frame_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"episode_index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"index": {
"dtype": "int64",
"shape": [1],
"names": null
},
"task_index": {
"dtype": "int64",
"shape": [1],
"names": null
}
}
}