Robotics 39
☆ Beyond One Shot, Beyond One Perspective: Cross-View and Long-Horizon Distillation for Better LiDAR Representations ICCV 2025
LiDAR representation learning aims to extract rich structural and semantic
information from large-scale, readily available datasets, reducing reliance on
costly human annotations. However, existing LiDAR representation strategies
often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting
their effectiveness. In this work, we propose LiMA, a novel long-term
image-to-LiDAR Memory Aggregation framework that explicitly captures longer
range temporal correlations to enhance LiDAR representation learning. LiMA
comprises three key components: 1) a Cross-View Aggregation module that aligns
and fuses overlapping regions across neighboring camera views, constructing a
more unified and redundancy-free memory bank; 2) a Long-Term Feature
Propagation mechanism that efficiently aligns and integrates multi-frame image
features, reinforcing temporal coherence during LiDAR representation learning;
and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency
across driving sequences, improving generalization to unseen environments. LiMA
maintains high pretraining efficiency and incurs no additional computational
overhead during downstream tasks. Extensive experiments on mainstream
LiDAR-based perception benchmarks demonstrate that LiMA significantly improves
both LiDAR semantic segmentation and 3D object detection. We hope this work
inspires more effective pretraining paradigms for autonomous driving. The code
has be made publicly accessible for future research.
comment: ICCV 2025; 26 pages, 12 figures, 10 tables; Code at
http://github.com/Xiangxu-0103/LiMA
☆ From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving SC 2025
Fabian Konstantinidis, Ariel Dallari Guerreiro, Raphael Trumpp, Moritz Sackmann, Ulrich Hofmann, Marco Caccamo, Christoph Stiller
Accurate motion prediction of surrounding traffic participants is crucial for
the safe and efficient operation of automated vehicles in dynamic environments.
Marginal prediction models commonly forecast each agent's future trajectories
independently, often leading to sub-optimal planning decisions for an automated
vehicle. In contrast, joint prediction models explicitly account for the
interactions between agents, yielding socially and physically consistent
predictions on a scene level. However, existing approaches differ not only in
their problem formulation but also in the model architectures and
implementation details used, making it difficult to compare them. In this work,
we systematically investigate different approaches to joint motion prediction,
including post-processing of the marginal predictions, explicitly training the
model for joint predictions, and framing the problem as a generative task. We
evaluate each approach in terms of prediction accuracy, multi-modality, and
inference efficiency, offering a comprehensive analysis of the strengths and
limitations of each approach. Several prediction examples are available at
https://frommarginaltojointpred.github.io/.
comment: Accepted at International Conference on Intelligent Transportation
Systems 2025 (ITSC 2025)
☆ Action Space Reduction Strategies for Reinforcement Learning in Autonomous Driving
Reinforcement Learning (RL) offers a promising framework for autonomous
driving by enabling agents to learn control policies through interaction with
environments. However, large and high-dimensional action spaces often used to
support fine-grained control can impede training efficiency and increase
exploration costs. In this study, we introduce and evaluate two novel
structured action space modification strategies for RL in autonomous driving:
dynamic masking and relative action space reduction. These approaches are
systematically compared against fixed reduction schemes and full action space
baselines to assess their impact on policy learning and performance. Our
framework leverages a multimodal Proximal Policy Optimization agent that
processes both semantic image sequences and scalar vehicle states. The proposed
dynamic and relative strategies incorporate real-time action masking based on
context and state transitions, preserving action consistency while eliminating
invalid or suboptimal choices. Through comprehensive experiments across diverse
driving routes, we show that action space reduction significantly improves
training stability and policy performance. The dynamic and relative schemes, in
particular, achieve a favorable balance between learning speed, control
precision, and generalization. These findings highlight the importance of
context-aware action space design for scalable and reliable RL in autonomous
driving tasks.
☆ StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
Vision-and-Language Navigation (VLN) in real-world settings requires agents
to process continuous visual streams and generate actions with low latency
grounded in language instructions. While Video-based Large Language Models
(Video-LLMs) have driven recent progress, current VLN methods based on
Video-LLM often face trade-offs among fine-grained visual understanding,
long-term context modeling and computational efficiency. We introduce
StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context
modeling strategy to support multi-modal reasoning over interleaved vision,
language and action inputs. The fast-streaming dialogue context facilitates
responsive action generation through a sliding-window of active dialogues,
while the slow-updating memory context compresses historical visual states
using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN
achieves coherent multi-turn dialogue through efficient KV cache reuse,
supporting long video streams with bounded context size and inference cost.
Experiments on VLN-CE benchmarks demonstrate state-of-the-art performance with
stable low latency, ensuring robustness and efficiency in real-world
deployment. The project page is:
\href{https://streamvln.github.io/}{https://streamvln.github.io/}.
☆ NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving
Autonomous driving systems have made significant advances in Q&A, perception,
prediction, and planning based on local visual information, yet they struggle
to incorporate broader navigational context that human drivers routinely
utilize. We address this critical gap between local sensor data and global
navigation information by proposing NavigScene, an auxiliary navigation-guided
natural language dataset that simulates a human-like driving environment within
autonomous driving systems. Moreover, we develop three complementary paradigms
to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances
vision-language models by incorporating navigation context into the prompting
approach; (2) Navigation-guided Preference Optimization, a reinforcement
learning method that extends Direct Preference Optimization to improve
vision-language model responses by establishing preferences for
navigation-relevant summarized information; and (3) Navigation-guided
Vision-Language-Action model, which integrates navigation guidance and
vision-language models with conventional driving models through feature fusion.
Extensive experiments demonstrate that our approaches significantly improve
performance across perception, prediction, planning, and question-answering
tasks by enabling reasoning capabilities beyond visual range and improving
generalization to diverse driving scenarios. This work represents a significant
step toward more comprehensive autonomous driving systems capable of navigating
complex, unfamiliar environments with greater reliability and safety.
comment: Accepted by ACM Multimedia 2025
☆ EmbodieDreamer: Advancing Real2Sim2Real Transfer for Policy Training via Embodied World Modeling
Boyuan Wang, Xinpan Meng, Xiaofeng Wang, Zheng Zhu, Angen Ye, Yang Wang, Zhiqin Yang, Chaojun Ni, Guan Huang, Xingang Wang
The rapid advancement of Embodied AI has led to an increasing demand for
large-scale, high-quality real-world data. However, collecting such embodied
data remains costly and inefficient. As a result, simulation environments have
become a crucial surrogate for training robot policies. Yet, the significant
Real2Sim2Real gap remains a critical bottleneck, particularly in terms of
physical dynamics and visual appearance. To address this challenge, we propose
EmbodieDreamer, a novel framework that reduces the Real2Sim2Real gap from both
the physics and appearance perspectives. Specifically, we propose PhysAligner,
a differentiable physics module designed to reduce the Real2Sim physical gap.
It jointly optimizes robot-specific parameters such as control gains and
friction coefficients to better align simulated dynamics with real-world
observations. In addition, we introduce VisAligner, which incorporates a
conditional video diffusion model to bridge the Sim2Real appearance gap by
translating low-fidelity simulated renderings into photorealistic videos
conditioned on simulation states, enabling high-fidelity visual transfer.
Extensive experiments validate the effectiveness of EmbodieDreamer. The
proposed PhysAligner reduces physical parameter estimation error by 3.74%
compared to simulated annealing methods while improving optimization speed by
89.91\%. Moreover, training robot policies in the generated photorealistic
environment leads to a 29.17% improvement in the average task success rate
across real-world tasks after reinforcement learning. Code, model and data will
be publicly available.
comment: Project Page: https://embodiedreamer.github.io/
☆ Critiques of World Models
World Model, the supposed algorithmic surrogate of the real-world environment
which biological agents experience with and act upon, has been an emerging
topic in recent years because of the rising needs to develop virtual agents
with artificial (general) intelligence. There has been much debate on what a
world model really is, how to build it, how to use it, and how to evaluate it.
In this essay, starting from the imagination in the famed Sci-Fi classic Dune,
and drawing inspiration from the concept of "hypothetical thinking" in
psychology literature, we offer critiques of several schools of thoughts on
world modeling, and argue the primary goal of a world model to be simulating
all actionable possibilities of the real world for purposeful reasoning and
acting. Building on the critiques, we propose a new architecture for a
general-purpose world model, based on hierarchical, multi-level, and mixed
continuous/discrete representations, and a generative and self-supervision
learning framework, with an outlook of a Physical, Agentic, and Nested (PAN)
AGI system enabled by such a model.
☆ LERa: Replanning with Visual Feedback in Instruction Following IROS 2025
Svyatoslav Pchelintsev, Maxim Patratskiy, Anatoly Onishchenko, Alexandr Korchemnyi, Aleksandr Medvedev, Uliana Vinogradova, Ilya Galuzinsky, Aleksey Postnikov, Alexey K. Kovalev, Aleksandr I. Panov
Large Language Models are increasingly used in robotics for task planning,
but their reliance on textual inputs limits their adaptability to real-world
changes and failures. To address these challenges, we propose LERa - Look,
Explain, Replan - a Visual Language Model-based replanning approach that
utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB
image, a natural language instruction, an initial task plan, and failure
detection - without additional information such as object detection or
predefined conditions that may be unavailable in a given scenario. The
replanning process consists of three steps: (i) Look, where LERa generates a
scene description and identifies errors; (ii) Explain, where it provides
corrective guidance; and (iii) Replan, where it modifies the plan accordingly.
LERa is adaptable to various agent architectures and can handle errors from
both dynamic scene changes and task execution failures. We evaluate LERa on the
newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40%
improvement over baselines in dynamic environments. In tabletop manipulation
tasks with a predefined probability of task failure within the PyBullet
simulator, LERa improves success rates by up to 67%. Further experiments,
including real-world trials with a tabletop manipulator robot, confirm LERa's
effectiveness in replanning. We demonstrate that LERa is a robust and adaptable
solution for error-aware task execution in robotics. The code is available at
https://lera-robo.github.io.
comment: IROS 2025
☆ VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots IROS 2025
In the field of robotics, researchers face a critical challenge in ensuring
reliable and efficient task planning. Verifying high-level task plans before
execution significantly reduces errors and enhance the overall performance of
these systems. In this paper, we propose an architecture for automatically
verifying high-level task plans before their execution in simulator or
real-world environments. Leveraging Large Language Models (LLMs), our approach
consists of two key steps: first, the conversion of natural language
instructions into Linear Temporal Logic (LTL), followed by a comprehensive
analysis of action sequences. The module uses the reasoning capabilities of the
LLM to evaluate logical coherence and identify potential gaps in the plan.
Rigorous testing on datasets of varying complexity demonstrates the broad
applicability of the module to household tasks. We contribute to improving the
reliability and efficiency of task planning and addresses the critical need for
robust pre-execution verification in autonomous systems. The code is available
at https://verifyllm.github.io.
comment: IROS 2025
☆ VOTE: Vision-Language-Action Optimization with Trajectory Ensemble Voting
Juyi Lin, Amir Taherin, Arash Akbari, Arman Akbari, Lei Lu, Guangyu Chen, Taskin Padir, Xiaomeng Yang, Weiwei Chen, Yiqian Li, Xue Lin, David Kaeli, Pu Zhao, Yanzhi Wang
Recent large-scale Vision Language Action (VLA) models have shown superior
performance in robotic manipulation tasks guided by natural language. However,
their generalization remains limited when applied to novel objects or
unfamiliar environments that lie outside the training distribution. To address
this, many existing approaches integrate additional components such as depth
estimation, segmentation, or even diffusion to improve generalization, at the
cost of adding significant computation overhead, resulting in low efficiency.
This motivates the exploration of efficient action prediction methods, which
are independent of additional high-level visual representations or diffusion
techniques. In this work, we propose VOTE, an efficient and general framework
for the optimization and acceleration of VLA models. In details, we propose a
novel tokenizer-free fine-tuning approach for parallel accurate action
prediction, which reduces computational overhead and accelerates inference
speed. Additionally, we adopt an ensemble voting strategy for the action
sampling, which significantly improves model performance and enhances
generalization. Experimental results show that our method achieves
state-of-the-art performance with 35× faster inference and 145 Hz
throughput. All the details and codes will be open-sourced.
☆ Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance
Accurate trajectory prediction is critical for safe autonomous navigation,
yet the impact of dataset design on model performance remains understudied.
This work systematically examines how feature selection, cross-dataset
transfer, and geographic diversity influence trajectory prediction accuracy in
multi-agent settings. We evaluate a state-of-the-art model using our novel L4
Motion Forecasting dataset based on our own data recordings in Germany and the
US. This includes enhanced map and agent features. We compare our dataset to
the US-centric Argoverse 2 benchmark. First, we find that incorporating
supplementary map and agent features unique to our dataset, yields no
measurable improvement over baseline features, demonstrating that modern
architectures do not need extensive feature sets for optimal performance. The
limited features of public datasets are sufficient to capture convoluted
interactions without added complexity. Second, we perform cross-dataset
experiments to evaluate how effective domain knowledge can be transferred
between datasets. Third, we group our dataset by country and check the
knowledge transfer between different driving cultures.
☆ From Autonomy to Agency: Agentic Vehicles for Human-Centered Mobility Systems
Autonomy, from the Greek autos (self) and nomos (law), refers to the capacity
to operate according to internal rules without external control. Accordingly,
autonomous vehicles (AuVs) are defined as systems capable of perceiving their
environment and executing preprogrammed tasks independently of external input.
However, both research and real-world deployments increasingly showcase
vehicles that demonstrate behaviors beyond this definition (including the SAE
levels 1 to 6), such as interaction with humans and machines, goal adaptation,
contextual reasoning, external tool use, and long-term planning, particularly
with the integration of large language models (LLMs) and agentic AI systems.
These developments reveal a conceptual gap between technical autonomy and the
broader cognitive and social capabilities needed for future human-centered
mobility systems. To address this, we introduce the concept of agentic vehicles
(AgVs), referring to vehicles that integrate agentic AI to reason, adapt, and
interact within complex environments. This paper presents a systems-level
framework to characterize AgVs, focusing on their cognitive and communicative
layers and differentiating them from conventional AuVs. It synthesizes relevant
advances in agentic AI, robotics, multi-agent systems, and human-machine
interaction, and highlights how agentic AI, through high-level reasoning and
tool use, can function not merely as computational tools but as interactive
agents embedded in mobility ecosystems. The paper concludes by identifying key
challenges in the development and governance of AgVs, including safety,
real-time control, public acceptance, ethical alignment, and regulatory
frameworks.
☆ Unifying Robot Optimization: Monte Carlo Tree Search with Tensor Factorization
Many robotic tasks, such as inverse kinematics, motion planning, and optimal
control, can be formulated as optimization problems. Solving these problems
involves addressing nonlinear kinematics, complex contact dynamics, and
long-horizon planning, each posing distinct challenges for state-of-the-art
optimization methods. To efficiently solve a wide range of tasks across varying
scenarios, researchers either develop specialized algorithms for the task to
achieve, or switch between different frameworks. Monte Carlo Tree Search (MCTS)
is a general-purpose decision-making tool that enables strategic exploration
across problem instances without relying on task-specific structures. However,
MCTS suffers from combinatorial complexity, leading to slow convergence and
high memory usage. To address this limitation, we propose \emph{Tensor Train
Tree Search} (TTTS), which leverages tensor factorization to exploit the
separable structure of decision trees. This yields a low-rank,
linear-complexity representation that significantly reduces both computation
time and storage requirements. We prove that TTTS can efficiently reach the
bounded global optimum within a finite time. Experimental results across
inverse kinematics, motion planning around obstacles, multi-stage motion
planning, and bimanual whole-body manipulation demonstrate the efficiency of
TTTS on a diverse set of robotic tasks.
comment: 46 pages, 8 figures
☆ Automated UAV-based Wind Turbine Blade Inspection: Blade Stop Angle Estimation and Blade Detail Prioritized Exposure Adjustment IROS 2025
Unmanned aerial vehicles (UAVs) are critical in the automated inspection of
wind turbine blades. Nevertheless, several issues persist in this domain.
Firstly, existing inspection platforms encounter challenges in meeting the
demands of automated inspection tasks and scenarios. Moreover, current blade
stop angle estimation methods are vulnerable to environmental factors,
restricting their robustness. Additionally, there is an absence of real-time
blade detail prioritized exposure adjustment during capture, where lost details
cannot be restored through post-optimization. To address these challenges, we
introduce a platform and two approaches. Initially, a UAV inspection platform
is presented to meet the automated inspection requirements. Subsequently, a
Fermat point based blade stop angle estimation approach is introduced,
achieving higher precision and success rates. Finally, we propose a blade
detail prioritized exposure adjustment approach to ensure appropriate
brightness and preserve details during image capture. Extensive tests,
comprising over 120 flights across 10 wind turbine models in 5 operational wind
farms, validate the effectiveness of the proposed approaches in enhancing
inspection autonomy.
comment: 8 pages, 7 figures, accepted by IROS 2025
☆ Piggyback Camera: Easy-to-Deploy Visual Surveillance by Mobile Sensing on Commercial Robot Vacuums
This paper presents Piggyback Camera, an easy-to-deploy system for visual
surveillance using commercial robot vacuums. Rather than requiring access to
internal robot systems, our approach mounts a smartphone equipped with a camera
and Inertial Measurement Unit (IMU) on the robot, making it applicable to any
commercial robot without hardware modifications. The system estimates robot
poses through neural inertial navigation and efficiently captures images at
regular spatial intervals throughout the cleaning task. We develop a novel
test-time data augmentation method called Rotation-Augmented Ensemble (RAE) to
mitigate domain gaps in neural inertial navigation. A loop closure method that
exploits robot cleaning patterns further refines these estimated poses. We
demonstrate the system with an object mapping application that analyzes
captured images to geo-localize objects in the environment. Experimental
evaluation in retail environments shows that our approach achieves 0.83 m
relative pose error for robot localization and 0.97 m positional error for
object mapping of over 100 items.
☆ Dynamics and multi-stability of a rotor-actuated Twistcar robot with passive steering joint
The nonlinear dynamics of many under-actuated wheeled platforms are governed
by nonholonomic constraints of no-skid for passively rolling wheels, coupled
with momentum balance. In most of theoretical models, the shape variables, i.e.
joint angles, are directly prescribed as periodic inputs, such as steering
angle of the Twistcar. In this work, we study a variant of the Twistcar model
where the actuation input is periodic oscillations of an inertial rotor
attached to the main body, while the steering joint is passively free to
rotate. Remarkably, the dynamics of this model is extremely rich, and includes
multiplicity of periodic solutions, both symmetric and asymmetric, as well as
stability transitions and bifurcations. We conduct numerical simulations as
well as asymptotic analysis of the vehicle's reduced equations of motion. We
use perturbation expansion in order to obtain leading-order dynamics under
symmetric periodic solution. Then, we utilize harmonic balance and further
scaling assumptions in order to approximate the conditions for
symmetry-breaking pitchfork bifurcation and stability transition of the
symmetric periodic solution, as a function of actuation frequency and
structural parameters. The asymptotic results show good agreement with
numerical simulations. The results highlight the role of passive shape
variables in generating multi-stable periodic solutions for nonholonomic
systems of robotic locomotion.
comment: Supporting Information is available at
https://yizhar.net.technion.ac.il/files/2025/06/SI-MATLAB-file-Anna-Z.zip
☆ Safe Bimanual Teleoperation with Language-Guided Collision Avoidance
Teleoperating precise bimanual manipulations in cluttered environments is
challenging for operators, who often struggle with limited spatial perception
and difficulty estimating distances between target objects, the robot's body,
obstacles, and the surrounding environment. To address these challenges, local
robot perception and control should assist the operator during teleoperation.
In this work, we introduce a safe teleoperation system that enhances operator
control by preventing collisions in cluttered environments through the
combination of immersive VR control and voice-activated collision avoidance.
Using HTC Vive controllers, operators directly control a bimanual mobile
manipulator, while spoken commands such as "avoid the yellow tool" trigger
visual grounding and segmentation to build 3D obstacle meshes. These meshes are
integrated into a whole-body controller to actively prevent collisions during
teleoperation. Experiments in static, cluttered scenes demonstrate that our
system significantly improves operational safety without compromising task
efficiency.
☆ Interaction-Merged Motion Planning: Effectively Leveraging Diverse Motion Datasets for Robust Planning ICCV 2025
Motion planning is a crucial component of autonomous robot driving. While
various trajectory datasets exist, effectively utilizing them for a target
domain remains challenging due to differences in agent interactions and
environmental characteristics. Conventional approaches, such as domain
adaptation or ensemble learning, leverage multiple source datasets but suffer
from domain imbalance, catastrophic forgetting, and high computational costs.
To address these challenges, we propose Interaction-Merged Motion Planning
(IMMP), a novel approach that leverages parameter checkpoints trained on
different domains during adaptation to the target domain. IMMP follows a
two-step process: pre-merging to capture agent behaviors and interactions,
sufficiently extracting diverse information from the source domain, followed by
merging to construct an adaptable model that efficiently transfers diverse
interactions to the target domain. Our method is evaluated on various planning
benchmarks and models, demonstrating superior performance compared to
conventional approaches.
comment: Accepted at ICCV 2025
☆ Training-free Generation of Temporally Consistent Rewards from VLMs
Yinuo Zhao, Jiale Yuan, Zhiyuan Xu, Xiaoshuai Hao, Xinyi Zhang, Kun Wu, Zhengping Che, Chi Harold Liu, Jian Tang
Recent advances in vision-language models (VLMs) have significantly improved
performance in embodied tasks such as goal decomposition and visual
comprehension. However, providing accurate rewards for robotic manipulation
without fine-tuning VLMs remains challenging due to the absence of
domain-specific robotic knowledge in pre-trained datasets and high
computational costs that hinder real-time applicability. To address this, we
propose T2-VLM, a novel training-free, temporally consistent
framework that generates accurate rewards through tracking the status changes
in VLM-derived subgoals. Specifically, our method first queries the VLM to
establish spatially aware subgoals and an initial completion estimate before
each round of interaction. We then employ a Bayesian tracking algorithm to
update the goal completion status dynamically, using subgoal hidden states to
generate structured rewards for reinforcement learning (RL) agents. This
approach enhances long-horizon decision-making and improves failure recovery
capabilities with RL. Extensive experiments indicate that T2-VLM
achieves state-of-the-art performance in two robot manipulation benchmarks,
demonstrating superior reward accuracy with reduced computation consumption. We
believe our approach not only advances reward generation techniques but also
contributes to the broader field of embodied AI. Project website:
https://t2-vlm.github.io/.
☆ CueLearner: Bootstrapping and local policy adaptation from relative feedback IROS 2025
Human guidance has emerged as a powerful tool for enhancing reinforcement
learning (RL). However, conventional forms of guidance such as demonstrations
or binary scalar feedback can be challenging to collect or have low information
content, motivating the exploration of other forms of human input. Among these,
relative feedback (i.e., feedback on how to improve an action, such as "more to
the left") offers a good balance between usability and information richness.
Previous research has shown that relative feedback can be used to enhance
policy search methods. However, these efforts have been limited to specific
policy classes and use feedback inefficiently. In this work, we introduce a
novel method to learn from relative feedback and combine it with off-policy
reinforcement learning. Through evaluations on two sparse-reward tasks, we
demonstrate our method can be used to improve the sample efficiency of
reinforcement learning by guiding its exploration process. Additionally, we
show it can adapt a policy to changes in the environment or the user's
preferences. Finally, we demonstrate real-world applicability by employing our
approach to learn a navigation policy in a sparse reward setting.
comment: Accepted to IROS 2025
☆ MOSU: Autonomous Long-range Robot Navigation with Multi-modal Scene Understanding
We present MOSU, a novel autonomous long-range navigation system that
enhances global navigation for mobile robots through multimodal perception and
on-road scene understanding. MOSU addresses the outdoor robot navigation
challenge by integrating geometric, semantic, and contextual information to
ensure comprehensive scene understanding. The system combines GPS and QGIS
map-based routing for high-level global path planning and multi-modal
trajectory generation for local navigation refinement. For trajectory
generation, MOSU leverages multi-modalities: LiDAR-based geometric data for
precise obstacle avoidance, image-based semantic segmentation for
traversability assessment, and Vision-Language Models (VLMs) to capture social
context and enable the robot to adhere to social norms in complex environments.
This multi-modal integration improves scene understanding and enhances
traversability, allowing the robot to adapt to diverse outdoor conditions. We
evaluate our system in real-world on-road environments and benchmark it on the
GND dataset, achieving a 10% improvement in traversability on navigable
terrains while maintaining a comparable navigation distance to existing global
navigation methods.
☆ DRAE: Dynamic Retrieval-Augmented Expert Networks for Lifelong Learning and Task Adaptation in Robotics ACL 2025
We introduce Dynamic Retrieval-Augmented Expert Networks (DRAE), a
groundbreaking architecture that addresses the challenges of lifelong learning,
catastrophic forgetting, and task adaptation by combining the dynamic routing
capabilities of Mixture-of-Experts (MoE); leveraging the knowledge-enhancement
power of Retrieval-Augmented Generation (RAG); incorporating a novel
hierarchical reinforcement learning (RL) framework; and coordinating through
ReflexNet-SchemaPlanner-HyperOptima (RSHO).DRAE dynamically routes expert
models via a sparse MoE gating mechanism, enabling efficient resource
allocation while leveraging external knowledge through parametric retrieval
(P-RAG) to augment the learning process. We propose a new RL framework with
ReflexNet for low-level task execution, SchemaPlanner for symbolic reasoning,
and HyperOptima for long-term context modeling, ensuring continuous adaptation
and memory retention. Experimental results show that DRAE significantly
outperforms baseline approaches in long-term task retention and knowledge
reuse, achieving an average task success rate of 82.5% across a set of dynamic
robotic manipulation tasks, compared to 74.2% for traditional MoE models.
Furthermore, DRAE maintains an extremely low forgetting rate, outperforming
state-of-the-art methods in catastrophic forgetting mitigation. These results
demonstrate the effectiveness of our approach in enabling flexible, scalable,
and efficient lifelong learning for robotics.
comment: Accepted to the main conference of the Annual Meeting of the
Association for Computational Linguistics (ACL 2025)
☆ Bio-Inspired Hybrid Map: Spatial Implicit Local Frames and Topological Map for Mobile Cobot Navigation
Navigation is a fundamental capacity for mobile robots, enabling them to
operate autonomously in complex and dynamic environments. Conventional
approaches use probabilistic models to localize robots and build maps
simultaneously using sensor observations. Recent approaches employ
human-inspired learning, such as imitation and reinforcement learning, to
navigate robots more effectively. However, these methods suffer from high
computational costs, global map inconsistency, and poor generalization to
unseen environments. This paper presents a novel method inspired by how humans
perceive and navigate themselves effectively in novel environments.
Specifically, we first build local frames that mimic how humans represent
essential spatial information in the short term. Points in local frames are
hybrid representations, including spatial information and learned features,
so-called spatial-implicit local frames. Then, we integrate spatial-implicit
local frames into the global topological map represented as a factor graph.
Lastly, we developed a novel navigation algorithm based on Rapid-Exploring
Random Tree Star (RRT*) that leverages spatial-implicit local frames and the
topological map to navigate effectively in environments. To validate our
approach, we conduct extensive experiments in real-world datasets and in-lab
environments. We open our source code at
https://github.com/tuantdang/simn}{https://github.com/tuantdang/simn.
☆ PRISM: Pointcloud Reintegrated Inference via Segmentation and Cross-attention for Manipulation
Robust imitation learning for robot manipulation requires comprehensive 3D
perception, yet many existing methods struggle in cluttered environments. Fixed
camera view approaches are vulnerable to perspective changes, and 3D point
cloud techniques often limit themselves to keyframes predictions, reducing
their efficacy in dynamic, contact-intensive tasks. To address these
challenges, we propose PRISM, designed as an end-to-end framework that directly
learns from raw point cloud observations and robot states, eliminating the need
for pretrained models or external datasets. PRISM comprises three main
components: a segmentation embedding unit that partitions the raw point cloud
into distinct object clusters and encodes local geometric details; a
cross-attention component that merges these visual features with processed
robot joint states to highlight relevant targets; and a diffusion module that
translates the fused representation into smooth robot actions. With training on
100 demonstrations per task, PRISM surpasses both 2D and 3D baseline policies
in accuracy and efficiency within our simulated environments, demonstrating
strong robustness in complex, object-dense scenarios. Code and some demos are
available on https://github.com/czknuaa/PRISM.
☆ Learning Robust Stereo Matching in the Wild with Selective Mixture-of-Experts
Yun Wang, Longguang Wang, Chenghao Zhang, Yongjian Zhang, Zhanjie Zhang, Ao Ma, Chenyou Fan, Tin Lun Lam, Junjie Hu
Recently, learning-based stereo matching networks have advanced
significantly. However, they often lack robustness and struggle to achieve
impressive cross-domain performance due to domain shifts and imbalanced
disparity distributions among diverse datasets. Leveraging Vision Foundation
Models (VFMs) can intuitively enhance the model's robustness, but integrating
such a model into stereo matching cost-effectively to fully realize their
robustness remains a key challenge. To address this, we propose SMoEStereo, a
novel framework that adapts VFMs for stereo matching through a tailored,
scene-specific fusion of Low-Rank Adaptation (LoRA) and Mixture-of-Experts
(MoE) modules. SMoEStereo introduces MoE-LoRA with adaptive ranks and
MoE-Adapter with adaptive kernel sizes. The former dynamically selects optimal
experts within MoE to adapt varying scenes across domains, while the latter
injects inductive bias into frozen VFMs to improve geometric feature
extraction. Importantly, to mitigate computational overhead, we further propose
a lightweight decision network that selectively activates MoE modules based on
input complexity, balancing efficiency with accuracy. Extensive experiments
demonstrate that our method exhibits state-of-the-art cross-domain and joint
generalization across multiple benchmarks without dataset-specific adaptation.
The code is available at
\textcolor{red}{https://github.com/cocowy1/SMoE-Stereo}.
☆ IDAGC: Adaptive Generalized Human-Robot Collaboration via Human Intent Estimation and Multimodal Policy Learning IROS 2025
In Human-Robot Collaboration (HRC), which encompasses physical interaction
and remote cooperation, accurate estimation of human intentions and seamless
switching of collaboration modes to adjust robot behavior remain paramount
challenges. To address these issues, we propose an Intent-Driven Adaptive
Generalized Collaboration (IDAGC) framework that leverages multimodal data and
human intent estimation to facilitate adaptive policy learning across
multi-tasks in diverse scenarios, thereby facilitating autonomous inference of
collaboration modes and dynamic adjustment of robotic actions. This framework
overcomes the limitations of existing HRC methods, which are typically
restricted to a single collaboration mode and lack the capacity to identify and
transition between diverse states. Central to our framework is a predictive
model that captures the interdependencies among vision, language, force, and
robot state data to accurately recognize human intentions with a Conditional
Variational Autoencoder (CVAE) and automatically switch collaboration modes. By
employing dedicated encoders for each modality and integrating extracted
features through a Transformer decoder, the framework efficiently learns
multi-task policies, while force data optimizes compliance control and intent
estimation accuracy during physical interactions. Experiments highlights our
framework's practical potential to advance the comprehensive development of
HRC.
comment: Accepted by IROS 2025
☆ Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions ICML
A long-standing problem in online reinforcement learning (RL) is of ensuring
sample efficiency, which stems from an inability to explore environments
efficiently. Most attempts at efficient exploration tackle this problem in a
setting where learning begins from scratch, without prior information available
to bootstrap learning. However, such approaches fail to leverage expert
demonstrations and simulators that can reset to arbitrary states. These
affordances are valuable resources that offer enormous potential to guide
exploration and speed up learning. In this paper, we explore how a small number
of expert demonstrations and a simulator allowing arbitrary resets can
accelerate learning during online RL. We find that training with a suitable
choice of an auxiliary start state distribution that may differ from the true
start state distribution of the underlying Markov Decision Process can
significantly improve sample efficiency. We find that using a notion of safety
to inform the choice of this auxiliary distribution significantly accelerates
learning. By using episode length information as a way to operationalize this
notion, we demonstrate state-of-the-art sample efficiency on a sparse-reward
hard-exploration environment.
comment: ICML ARLET Workshop 2024
☆ DragonFly: Single mmWave Radar 3D Localization of Highly Dynamic Tags in GPS-Denied Environments
The accurate localization and tracking of dynamic targets, such as equipment,
people, vehicles, drones, robots, and the assets that they interact with in
GPS-denied indoor environments is critical to enabling safe and efficient
operations in the next generation of spatially aware industrial facilities.
This paper presents DragonFly , a 3D localization system of highly dynamic
backscatter tags using a single MIMO mmWave radar. The system delivers the
first demonstration of a mmWave backscatter system capable of exploiting the
capabilities of MIMO radars for the 3D localization of mmID tags moving at high
speeds and accelerations at long ranges by introducing a critical Doppler
disambiguation algorithm and a fully integrated cross-polarized dielectric
lens-based mmID tag consuming a mere 68 uW. DragonFly was extensively evaluated
in static and dynamic configurations, including on a flying quadcopter, and
benchmarked against multiple baselines, demonstrating its ability to track the
positions of multiple tags with a median 3D accuracy of 12 cm at speeds and
acceleration on the order of 10 m/s-1 and 4 m/s-2 and at ranges of up to 50 m.
comment: 16 pages including appendix
♻ ☆ Human2LocoMan: Learning Versatile Quadrupedal Manipulation with Human Pretraining
Yaru Niu, Yunzhe Zhang, Mingyang Yu, Changyi Lin, Chenhao Li, Yikai Wang, Yuxiang Yang, Wenhao Yu, Tingnan Zhang, Zhenzhen Li, Jonathan Francis, Bingqing Chen, Jie Tan, Ding Zhao
Quadrupedal robots have demonstrated impressive locomotion capabilities in
complex environments, but equipping them with autonomous versatile manipulation
skills in a scalable way remains a significant challenge. In this work, we
introduce a cross-embodiment imitation learning system for quadrupedal
manipulation, leveraging data collected from both humans and LocoMan, a
quadruped equipped with multiple manipulation modes. Specifically, we develop a
teleoperation and data collection pipeline, which unifies and modularizes the
observation and action spaces of the human and the robot. To effectively
leverage the collected data, we propose an efficient modularized architecture
that supports co-training and pretraining on structured modality-aligned data
across different embodiments. Additionally, we construct the first manipulation
dataset for the LocoMan robot, covering various household tasks in both
unimanual and bimanual modes, supplemented by a corresponding human dataset. We
validate our system on six real-world manipulation tasks, where it achieves an
average success rate improvement of 41.9% overall and 79.7% under
out-of-distribution (OOD) settings compared to the baseline. Pretraining with
human data contributes a 38.6% success rate improvement overall and 82.7% under
OOD settings, enabling consistently better performance with only half the
amount of robot data. Our code, hardware, and data are open-sourced at:
https://human2bots.github.io.
♻ ☆ Occlusion-Aware Consistent Model Predictive Control for Robot Navigation in Occluded Obstacle-Dense Environments
Ensuring safety and motion consistency for robot navigation in occluded,
obstacle-dense environments is a critical challenge. In this context, this
study presents an occlusion-aware Consistent Model Predictive Control (CMPC)
strategy. To account for the occluded obstacles, it incorporates adjustable
risk regions that represent their potential future locations. Subsequently,
dynamic risk boundary constraints are developed online to ensure safety. The
CMPC then constructs multiple locally optimal trajectory branches (each
tailored to different risk regions) to balance between exploitation and
exploration. A shared consensus trunk is generated to ensure smooth transitions
between branches without significant velocity fluctuations, further preserving
motion consistency. To facilitate high computational efficiency and ensure
coordination across local trajectories, we use the alternating direction method
of multipliers (ADMM) to decompose the CMPC into manageable sub-problems for
parallel solving. The proposed strategy is validated through simulation and
real-world experiments on an Ackermann-steering robot platform. The results
demonstrate the effectiveness of the proposed CMPC strategy through comparisons
with baseline approaches in occluded, obstacle-dense environments.
♻ ☆ MMD-OPT : Maximum Mean Discrepancy Based Sample Efficient Collision Risk Minimization for Autonomous Driving
We propose MMD-OPT: a sample-efficient approach for minimizing the risk of
collision under arbitrary prediction distribution of the dynamic obstacles.
MMD-OPT is based on embedding distribution in Reproducing Kernel Hilbert Space
(RKHS) and the associated Maximum Mean Discrepancy (MMD). We show how these two
concepts can be used to define a sample efficient surrogate for collision risk
estimate. We perform extensive simulations to validate the effectiveness of
MMD-OPT on both synthetic and real-world datasets. Importantly, we show that
trajectory optimization with our MMD-based collision risk surrogate leads to
safer trajectories at low sample regimes than popular alternatives based on
Conditional Value at Risk (CVaR).
♻ ☆ Physics Encoded Blocks in Residual Neural Network Architectures for Digital Twin Models
Physics Informed Machine Learning has emerged as a popular approach for
modeling and simulation in digital twins, enabling the generation of accurate
models of processes and behaviors in real-world systems. However, existing
methods either rely on simple loss regularizations that offer limited physics
integration or employ highly specialized architectures that are difficult to
generalize across diverse physical systems. This paper presents a generic
approach based on a novel physics-encoded residual neural network (PERNN)
architecture that seamlessly combines data-driven and physics-based analytical
models to overcome these limitations. Our method integrates differentiable
physics blocks-implementing mathematical operators from physics-based models
with feed-forward learning blocks, while intermediate residual blocks ensure
stable gradient flow during training. Consequently, the model naturally adheres
to the underlying physical principles even when prior physics knowledge is
incomplete, thereby improving generalizability with low data requirements and
reduced model complexity. We investigate our approach in two application
domains. The first is a steering model for autonomous vehicles in a simulation
environment, and the second is a digital twin for climate modeling using an
ordinary differential equation (ODE)-based model of Net Ecosystem Exchange
(NEE) to enable gap-filling in flux tower data. In both cases, our method
outperforms conventional neural network approaches as well as state-of-the-art
Physics Informed Machine Learning methods.
comment: Accepted at Machine Learning (Springer). Under Publishing Process
♻ ☆ NOVA: Navigation via Object-Centric Visual Autonomy for High-Speed Target Tracking in Unstructured GPS-Denied Environments
Autonomous aerial target tracking in unstructured and GPS-denied environments
remains a fundamental challenge in robotics. Many existing methods rely on
motion capture systems, pre-mapped scenes, or feature-based localization to
ensure safety and control, limiting their deployment in real-world conditions.
We introduce NOVA, a fully onboard, object-centric framework that enables
robust target tracking and collision-aware navigation using only a stereo
camera and an IMU. Rather than constructing a global map or relying on absolute
localization, NOVA formulates perception, estimation, and control entirely in
the target's reference frame. A tightly integrated stack combines a lightweight
object detector with stereo depth completion, followed by histogram-based
filtering to infer robust target distances under occlusion and noise. These
measurements feed a visual-inertial state estimator that recovers the full
6-DoF pose of the robot relative to the target. A nonlinear model predictive
controller (NMPC) plans dynamically feasible trajectories in the target frame.
To ensure safety, high-order control barrier functions are constructed online
from a compact set of high-risk collision points extracted from depth, enabling
real-time obstacle avoidance without maps or dense representations. We validate
NOVA across challenging real-world scenarios, including urban mazes, forest
trails, and repeated transitions through buildings with intermittent GPS loss
and severe lighting changes that disrupt feature-based localization. Each
experiment is repeated multiple times under similar conditions to assess
resilience, showing consistent and reliable performance. NOVA achieves agile
target following at speeds exceeding 50 km/h. These results show that
high-speed vision-based tracking is possible in the wild using only onboard
sensing, with no reliance on external localization or environment assumptions.
♻ ☆ Graph-Assisted Stitching for Offline Hierarchical Reinforcement Learning ICML 2025
Existing offline hierarchical reinforcement learning methods rely on
high-level policy learning to generate subgoal sequences. However, their
efficiency degrades as task horizons increase, and they lack effective
strategies for stitching useful state transitions across different
trajectories. We propose Graph-Assisted Stitching (GAS), a novel framework that
formulates subgoal selection as a graph search problem rather than learning an
explicit high-level policy. By embedding states into a Temporal Distance
Representation (TDR) space, GAS clusters semantically similar states from
different trajectories into unified graph nodes, enabling efficient transition
stitching. A shortest-path algorithm is then applied to select subgoal
sequences within the graph, while a low-level policy learns to reach the
subgoals. To improve graph quality, we introduce the Temporal Efficiency (TE)
metric, which filters out noisy or inefficient transition states, significantly
enhancing task performance. GAS outperforms prior offline HRL methods across
locomotion, navigation, and manipulation tasks. Notably, in the most
stitching-critical task, it achieves a score of 88.3, dramatically surpassing
the previous state-of-the-art score of 1.0. Our source code is available at:
https://github.com/qortmdgh4141/GAS.
comment: ICML 2025
♻ ☆ Event-based Stereo Depth Estimation: A Survey
Stereopsis has widespread appeal in robotics as it is the predominant way by
which living beings perceive depth to navigate our 3D world. Event cameras are
novel bio-inspired sensors that detect per-pixel brightness changes
asynchronously, with very high temporal resolution and high dynamic range,
enabling machine perception in high-speed motion and broad illumination
conditions. The high temporal precision also benefits stereo matching, making
disparity (depth) estimation a popular research area for event cameras ever
since its inception. Over the last 30 years, the field has evolved rapidly,
from low-latency, low-power circuit design to current deep learning (DL)
approaches driven by the computer vision community. The bibliography is vast
and difficult to navigate for non-experts due its highly interdisciplinary
nature. Past surveys have addressed distinct aspects of this topic, in the
context of applications, or focusing only on a specific class of techniques,
but have overlooked stereo datasets. This survey provides a comprehensive
overview, covering both instantaneous stereo and long-term methods suitable for
simultaneous localization and mapping (SLAM), along with theoretical and
empirical comparisons. It is the first to extensively review DL methods as well
as stereo datasets, even providing practical suggestions for creating new
benchmarks to advance the field. The main advantages and challenges faced by
event-based stereo depth estimation are also discussed. Despite significant
progress, challenges remain in achieving optimal performance in not only
accuracy but also efficiency, a cornerstone of event-based computing. We
identify several gaps and propose future research directions. We hope this
survey inspires future research in this area, by serving as an accessible entry
point for newcomers, as well as a practical guide for seasoned researchers in
the community.
comment: 28 pages, 24 figures, 7 tables. Project page:
https://github.com/tub-rip/EventStereoSurvey
♻ ☆ Event-based Photometric Bundle Adjustment
We tackle the problem of bundle adjustment (i.e., simultaneous refinement of
camera poses and scene map) for a purely rotating event camera. Starting from
first principles, we formulate the problem as a classical non-linear least
squares optimization. The photometric error is defined using the event
generation model directly in the camera rotations and the semi-dense scene
brightness that triggers the events. We leverage the sparsity of event data to
design a tractable Levenberg-Marquardt solver that handles the very large
number of variables involved. To the best of our knowledge, our method, which
we call Event-based Photometric Bundle Adjustment (EPBA), is the first
event-only photometric bundle adjustment method that works on the brightness
map directly and exploits the space-time characteristics of event data, without
having to convert events into image-like representations. Comprehensive
experiments on both synthetic and real-world datasets demonstrate EPBA's
effectiveness in decreasing the photometric error (by up to 90%), yielding
results of unparalleled quality. The refined maps reveal details that were
hidden using prior state-of-the-art rotation-only estimation methods. The
experiments on modern high-resolution event cameras show the applicability of
EPBA to panoramic imaging in various scenarios (without map initialization, at
multiple resolutions, and in combination with other methods, such as IMU dead
reckoning or previous event-based rotation estimation methods). We make the
source code publicly available. https://github.com/tub-rip/epba
comment: 21 pages, 19 figures, 10 tables. Project page:
https://github.com/tub-rip/epba
♻ ☆ Learning Maximal Safe Sets Using Hypernetworks for MPC-based Local Trajectory Planning in Unknown Environments
This paper presents a novel learning-based approach for online estimation of
maximal safe sets for local trajectory planning in unknown static environments.
The neural representation of a set is used as the terminal set constraint for a
model predictive control (MPC) local planner, resulting in improved recursive
feasibility and safety. To achieve real-time performance and desired
generalization properties, we employ the idea of hypernetworks. We use the
Hamilton-Jacobi (HJ) reachability analysis as the source of supervision during
the training process, allowing us to consider general nonlinear dynamics and
arbitrary constraints. The proposed method is extensively evaluated against
relevant baselines in simulations for different environments and robot
dynamics. The results show an increase in success rate of up to 52% compared to
the best baseline while maintaining comparable execution speed. Additionally,
we deploy our proposed method, NTC-MPC, on a physical robot and demonstrate its
ability to safely avoid obstacles in scenarios where the baselines fail.
♻ ☆ RoboMonkey: Scaling Test-Time Sampling and Verification for Vision-Language-Action Models
Jacky Kwok, Christopher Agia, Rohan Sinha, Matt Foutter, Shulu Li, Ion Stoica, Azalia Mirhoseini, Marco Pavone
Vision-Language-Action (VLA) models have demonstrated remarkable capabilities
in visuomotor control, yet ensuring their robustness in unstructured real-world
environments remains a persistent challenge. In this paper, we investigate
test-time scaling through the lens of sampling and verification as means to
enhance the robustness and generalization of VLAs. We first demonstrate that
the relationship between action error and the number of generated samples
follows an exponentiated power law across a range of VLAs, indicating the
existence of inference-time scaling laws. Building on these insights, we
introduce RoboMonkey, a test-time scaling framework for VLAs. At deployment,
RoboMonkey samples a small set of actions from a VLA, applies Gaussian
perturbation and majority voting to construct an action proposal distribution,
and then uses a Vision Language Model (VLM)-based verifier to select the
optimal action. We propose a synthetic data generation pipeline for training
such VLM-based action verifiers, and demonstrate that scaling the synthetic
dataset consistently improves verification and downstream accuracy. Through
extensive simulated and hardware experiments, we show that pairing existing
VLAs with RoboMonkey yields significant performance gains, achieving a 25%
absolute improvement on out-of-distribution tasks and 9% on in-distribution
tasks. Additionally, when adapting to new robot setups, we show that
fine-tuning both VLAs and action verifiers yields a 7% performance increase
compared to fine-tuning VLAs alone.
♻ ☆ D4orm: Multi-Robot Trajectories with Dynamics-aware Diffusion Denoised Deformations IROS
This work presents an optimization method for generating kinodynamically
feasible and collision-free multi-robot trajectories that exploits an
incremental denoising scheme in diffusion models. Our key insight is that
high-quality trajectories can be discovered merely by denoising noisy
trajectories sampled from a distribution. This approach has no learning
component, relying instead on only two ingredients: a dynamical model of the
robots to obtain feasible trajectories via rollout, and a fitness function to
guide denoising with Monte Carlo gradient approximation. The proposed framework
iteratively optimizes a deformation for the previous trajectory with the
current denoising process, allows anytime refinement as time permits, supports
different dynamics, and benefits from GPU acceleration. Our evaluations for
differential-drive and holonomic teams with up to 16 robots in 2D and 3D worlds
show its ability to discover high-quality solutions faster than other black-box
optimization methods such as MPPI. In a 2D holonomic case with 16 robots, it is
almost twice as fast. As evidence for feasibility, we demonstrate zero-shot
deployment of the planned trajectories on eight multirotors.
comment: Accepted by 2025 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS)