M2P2: Multi-Modal Passive Perception Dataset
for Off-Road Mobility in Extreme Low-Light Conditions


[Paper] [Video] [Dataset] [Code]
About
Long-duration, off-road, autonomous missions require robots to continuously perceive their surroundings regardless of the ambient lighting conditions. Most existing autonomy systems heavily rely on active sensing, e.g., LiDAR, RADAR, and Time-of-Flight sensors, or use (stereo) visible light imaging sensors, e.g., color cameras, to perceive environment geometry and semantics. In scenarios where fully passive perception is required and lighting conditions are degraded to an extent that visible light cameras fail to perceive, most downstream mobility tasks such as obstacle avoidance become impossible. To address such a challenge, we present a Multi-Modal Passive Perception dataset, M2P2, to enable off-road mobility in low-light to no-light conditions. We design a multi-modal sensor suite including thermal, event, and stereo RGB cameras, GPS, two Inertia Measurement Units (IMUs), as well as a high-resolution LiDAR for ground truth, with a novel multi-sensor calibration procedure that can efficiently transform multi-modal perceptual streams into a common coordinate system. Our 10-hour, 32 km dataset also includes mobility data such as robot odometry and actions and covers well-lit, low-light, and no-light conditions, along with paved, on-trail, and off-trail terrain. Our results demonstrate that off-road mobility and scene understanding under degraded visual environments is possible through only passive perception in extreme low-light conditions.
Sensor Suite
Our sensor suite includes a Xenics Ceres T 1280 thermal camera, a Prophesee Metavision EVK4 event camera, two FLIR Blackfly S cameras, a Yahboom 10-DoF IMU, a 3D Ouster OS1-128 LiDAR (with another embedded IMU) for ground truth.
We design a multi-modal calibration procedure based on a multi-material calibration target using an aluminum sheet of 3 mm thickness and carbon fiber squares of 35 mm. The sheet and the carbon fiber squares are cut into precise shape using a CNC milling machine achieving an accuracy of 0.05 mm. Since the aluminum sheet reflects most of the long wave infrared (IR) radiation (similar to a mirror in the visible spectrum), we anodize the aluminum sheet to eliminate unwanted reflection in the IR spectrum. After heating the calibration target to roughly 45°C, due to a large difference in emissivity of aluminum and carbon fiber, the checkerboard pattern appears in the thermal image. Due to the contrast in color of aluminum and carbon fiber, the same pattern is visible in both RGB cameras.
To correlate asynchronous event data with other synchronous data streams, such as thermal and RGB images, we employ a two-step approach. First, we reconstruct a grayscale image from the raw event stream using E2Calib. Additionally, we utilize the trigger input functionality of the event camera to precisely mark timestamps for frame reconstruction, enabling accurate temporal alignment between the reconstructed event frames and corresponding frames from other sensors. This method allows us to overcome the inherent asynchronous nature of event data and establish reliable temporal relationship with synchronous data streams, facilitating multi-modal sensor fusion and calibration.
We implement a synchronization scheme to synchronize multi-modal perception streams for calibration and data collection. We set the pulse width equal to the exposure time of the RGB camera and use this falling edge for reconstructing a frame at that time. We synchronize all four cameras to the LiDAR, which generates a 10 Hz sync pulse aligned to its encoder angle at 360°. This pulse triggers frame acquisition in the RGB and thermal cameras, with its edges marking temporal points in the event camera stream. The pulse width matches the RGB camera's exposure time, and its falling edge is used for event camera frame reconstruction. This approach aligns the reconstructed frame with the RGB camera's exposure completion, ensuring precise temporal correlation across all sensors.
All the synchronized frames are spliced to create a ROS-bag that is calibrated by the Kalibr calibration toolkit to generate camera intrinsic and extrinsic parameters.
The following figure shows the LiDAR point cloud overlaid on the corresponding RGB image, along with the reconstructed event frame and thermal image, demonstrating the spatial and temporal alignment of the multi-modal data.
Dataset
Our M2P2 dataset encompasses over 10 hours of data collected across various challenging terrain conditions. The data are gathered with the sensor suite mounted on a Clearpath Husky A200 robot. The dataset includes sequences from a diverse range of environments, progressing from fully prepared paved trails to non-paved off-road paths, and ultimately to unprepared off-trail environments within densely forested areas featuring thick vegetation and narrow passages. To capture a comprehensive range of lighting conditions, data collection is conducted at dusk, with luminosity levels varying from 20 lx to complete darkness (0 lx). This approach ensures the dataset's applicability to both well-lit and no-light scenarios, addressing the challenges of navigation in varying environmental conditions.
The dataset is structured as ROS-bag files, consisting of compressed RGB and thermal images at 10 FPS, asynchronous raw event stream, 3D point cloud data from LiDAR, IMU data, GPS coordinates, robot odometry and status messages, and human-commanded joystick inputs. All camera data are synchronized using the trigger pulse from the LiDAR, ensuring temporal alignment across multi-modal sensor inputs. Due to the dense canopy of the trees the GPS data is only available for 87.97% of the total dataset. To facilitate accurate sensor placement replication, we provide the URDFs (Unified Robotics Description Format) for the sensor suite configuration on the Husky platform, along with the calibrated transformations.
| Attribute | Quantity |
|---|---|
| Total Size | ≈2 TB |
| Total Distance | >32 km |
| Total Time | 10.15 h |
| Total GPS Lock Time | 8.93 h |
| Average Speed | 0.95 m/s |
| Number of RGB Images | 730606 |
| Number of Thermal Images | 361685 |
| Number of Events | 1.15×1011 |
| Number of Point Clouds | 365297 |
The following figure shows a LIO-SAM-generated map overlaid on a satellite image. The LiDAR point cloud aligns well with visible features (e.g., trail edges and vegetation), demonstrating mapping accuracy. The inset compares the estimated trajectory (blue) to raw GPS (green); the latter deviates significantly under dense tree cover, reflecting degraded signal quality, while the LIO-SAM trajectory remains consistently accurate.
Example Usages
We conduct two experiments using our M2P2 dataset to demonstrate its usefulness in off-road navigation under degraded lighting conditions.
End-to-End Navigation Learning
We deploy a Behavior Cloning model trained by our M2P2 dataset on the Husky robot for a 3.6 km autonomous navigation task on a paved hiking trail. The luminosity during the experiment ranges from 235 lx to 0 lx (indicated by the color of the path), with the robot completing the majority of the navigation in complete darkness (0 lx). The robot successfully completes the navigation, requiring only 11 human interventions when it goes off-course. Most interventions are because the pavement and the gravel on the side show similar temperature in the thermal input and therefore confuse the robot. More sophisticated techniques that leverage other sensor modalities, e.g., event camera, are necessary to enable more robust navigation.
Perception in Degraded Visual Environments
To evaluate the efficacy of M2P2 in enabling scene perception in degraded visual environments, we conduct a comparative analysis of metric depth estimation. Specifically, we train a U-Net, 31M parameters, to learn a mapping between thermal infrared imagery and corresponding depth information derived from the LiDAR point clouds. We compare the performance of this U-Net, trained on the M2P2 dataset, against DepthAnythingV2-Large, a monocular metric depth estimation model with approximately 335.3 million parameters. Qualitative inspection confirms that the U-Net trained on M2P2 generates depth maps of considerably higher fidelity compared to those produced by DepthAnythingV2-Large. This observation highlights the pivotal role of domain-specific datasets like M2P2 in enabling the development of robust perception models for degraded visual environments, where traditional RGB-based methods are inherently challenged. Our results suggest that such datasets are indispensable for bridging the gap between standard visual perception and the complexities introduced by atypical sensory inputs.
Passive Visual Odometry with Thermal and Event Data
A unique characteristic of M2P2 is the inclusion of calibrated, synchronized thermal and event camera data, enabling exploration of passive perception in extremely low-light conditions. While prior work has investigated visual-inertial odometry using RGB and event cameras, the fusion of thermal and event data for odometry remains relatively underexplored. This combination holds significant promise for applications where visible light is scarce or unavailable, such as nighttime off-road navigation or covert operations. The closest existing work is RAMP-VO, and M2P2 helps advance this area of research. To demonstrate the potential of this multi-modal fusion, we adapt the RAMP-VO framework, originally designed for RGB and event data, to process thermal and event data from M2P2. We focus on a challenging 157.5 m segment of the Burke Lake trail to evaluate the robustness of the approach under varying light levels. Crucially, we simulate reduced lighting conditions by systematically subsampling the event stream. This allows us to assess the performance of the thermal-event odometry system as the available information from the event camera decreases. We experiment with retaining 80%, 50%, and 25% of the original events, representing progressively darker scenarios, in addition to using the full event data (100%). The following table presents the translational Absolute Trajectory Error (ATE) for each event subsampling level. As expected, the error generally increases as the event data becomes sparser.
| Event Percentage | Translational ATE (m) |
|---|---|
| 100% (Full Event Data) | 8.79 |
| 80% | 11.60 |
| 50% | 12.79 |
| 25% | 12.49 |
Gallery
Links
Contact
For questions, please contact:
Dr. Xuesu Xiao
Department of Computer Science
George Mason University
4400 University Drive MSN 4A5, Fairfax, VA 22030 USA
xiao@gmu.edu