Robotics 54
☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren
We introduce Genie Envisioner (GE), a unified world foundation platform for
robotic manipulation that integrates policy learning, evaluation, and
simulation within a single video-generative framework. At its core, GE-Base is
a large-scale, instruction-conditioned video diffusion model that captures the
spatial, temporal, and semantic dynamics of real-world robotic interactions in
a structured latent space. Built upon this foundation, GE-Act maps latent
representations to executable action trajectories through a lightweight,
flow-matching decoder, enabling precise and generalizable policy inference
across diverse embodiments with minimal supervision. To support scalable
evaluation and training, GE-Sim serves as an action-conditioned neural
simulator, producing high-fidelity rollouts for closed-loop policy development.
The platform is further equipped with EWMBench, a standardized benchmark suite
measuring visual fidelity, physical consistency, and instruction-action
alignment. Together, these components establish Genie Envisioner as a scalable
and practical foundation for instruction-driven, general-purpose embodied
intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
☆ Towards Generalizable Safety in Crowd Navigation via Conformal Uncertainty Handling
Mobile robots navigating in crowds trained using reinforcement learning are
known to suffer performance degradation when faced with out-of-distribution
scenarios. We propose that by properly accounting for the uncertainties of
pedestrians, a robot can learn safe navigation policies that are robust to
distribution shifts. Our method augments agent observations with prediction
uncertainty estimates generated by adaptive conformal inference, and it uses
these estimates to guide the agent's behavior through constrained reinforcement
learning. The system helps regulate the agent's actions and enables it to adapt
to distribution shifts. In the in-distribution setting, our approach achieves a
96.93% success rate, which is over 8.80% higher than the previous
state-of-the-art baselines with over 3.72 times fewer collisions and 2.43 times
fewer intrusions into ground-truth human future trajectories. In three
out-of-distribution scenarios, our method shows much stronger robustness when
facing distribution shifts in velocity variations, policy changes, and
transitions from individual to group dynamics. We deploy our method on a real
robot, and experiments show that the robot makes safe and robust decisions when
interacting with both sparse and dense crowds. Our code and videos are
available on https://gen-safe-nav.github.io/.
comment: 9th Conference on Robot Learning (CoRL 2025); Project website:
https://gen-safe-nav.github.io/. arXiv admin note: text overlap with
arXiv:2407.17460
☆ TrajEvo: Trajectory Prediction Heuristics Design via LLM-driven Evolution
Trajectory prediction is a critical task in modeling human behavior,
especially in safety-critical domains such as social robotics and autonomous
vehicle navigation. Traditional heuristics based on handcrafted rules often
lack accuracy and generalizability. Although deep learning approaches offer
improved performance, they typically suffer from high computational cost,
limited explainability, and, importantly, poor generalization to
out-of-distribution (OOD) scenarios. In this paper, we introduce TrajEvo, a
framework that leverages Large Language Models (LLMs) to automatically design
trajectory prediction heuristics. TrajEvo employs an evolutionary algorithm to
generate and refine prediction heuristics from past trajectory data. We propose
two key innovations: Cross-Generation Elite Sampling to encourage population
diversity, and a Statistics Feedback Loop that enables the LLM to analyze and
improve alternative predictions. Our evaluations demonstrate that TrajEvo
outperforms existing heuristic methods across multiple real-world datasets, and
notably surpasses both heuristic and deep learning methods in generalizing to
an unseen OOD real-world dataset. TrajEvo marks a promising step toward the
automated design of fast, explainable, and generalizable trajectory prediction
heuristics. We release our source code to facilitate future research at
https://github.com/ai4co/trajevo.
comment: arXiv admin note: substantial text overlap with arXiv:2505.04480
☆ Robust adaptive fuzzy sliding mode control for trajectory tracking for of cylindrical manipulator
This research proposes a robust adaptive fuzzy sliding mode control (AFSMC)
approach to enhance the trajectory tracking performance of cylindrical robotic
manipulators, extensively utilized in applications such as CNC and 3D printing.
The proposed approach integrates fuzzy logic with sliding mode control (SMC) to
bolster adaptability and robustness, with fuzzy logic approximating the
uncertain dynamics of the system, while SMC ensures strong performance.
Simulation results in MATLAB/Simulink demonstrate that AFSMC significantly
improves trajectory tracking accuracy, stability, and disturbance rejection
compared to traditional methods. This research underscores the effectiveness of
AFSMC in controlling robotic manipulators, contributing to enhanced precision
in industrial robotic applications.
☆ CleanUpBench: Embodied Sweeping and Grasping Benchmark
Embodied AI benchmarks have advanced navigation, manipulation, and reasoning,
but most target complex humanoid agents or large-scale simulations that are far
from real-world deployment. In contrast, mobile cleaning robots with dual mode
capabilities, such as sweeping and grasping, are rapidly emerging as realistic
and commercially viable platforms. However, no benchmark currently exists that
systematically evaluates these agents in structured, multi-target cleaning
tasks, revealing a critical gap between academic research and real-world
applications. We introduce CleanUpBench, a reproducible and extensible
benchmark for evaluating embodied agents in realistic indoor cleaning
scenarios. Built on NVIDIA Isaac Sim, CleanUpBench simulates a mobile service
robot equipped with a sweeping mechanism and a six-degree-of-freedom robotic
arm, enabling interaction with heterogeneous objects. The benchmark includes
manually designed environments and one procedurally generated layout to assess
generalization, along with a comprehensive evaluation suite covering task
completion, spatial efficiency, motion quality, and control performance. To
support comparative studies, we provide baseline agents based on heuristic
strategies and map-based planning. CleanUpBench bridges the gap between
low-level skill evaluation and full-scene testing, offering a scalable testbed
for grounded, embodied intelligence in everyday settings.
☆ Mixed-Initiative Dialog for Human-Robot Collaborative Manipulation
Albert Yu, Chengshu Li, Luca Macesanu, Arnav Balaji, Ruchira Ray, Raymond Mooney, Roberto Martín-Martín
Effective robotic systems for long-horizon human-robot collaboration must
adapt to a wide range of human partners, whose physical behavior, willingness
to assist, and understanding of the robot's capabilities may change over time.
This demands a tightly coupled communication loop that grants both agents the
flexibility to propose, accept, or decline requests as they coordinate toward
completing the task effectively. We apply a Mixed-Initiative dialog paradigm to
Collaborative human-roBot teaming and propose MICoBot, a system that handles
the common scenario where both agents, using natural language, take initiative
in formulating, accepting, or rejecting proposals on who can best complete
different steps of a task. To handle diverse, task-directed dialog, and find
successful collaborative strategies that minimize human effort, MICoBot makes
decisions at three levels: (1) a meta-planner considers human dialog to
formulate and code a high-level collaboration strategy, (2) a planner optimally
allocates the remaining steps to either agent based on the robot's capabilities
(measured by a simulation-pretrained affordance model) and the human's
estimated availability to help, and (3) an action executor decides the
low-level actions to perform or words to say to the human. Our extensive
evaluations in simulation and real-world -- on a physical robot with 18 unique
human participants over 27 hours -- demonstrate the ability of our method to
effectively collaborate with diverse human users, yielding significantly
improved task success and user experience than a pure LLM baseline and other
agent allocation models. See additional videos and materials at
https://robin-lab.cs.utexas.edu/MicoBot/.
comment: Project website at https://robin-lab.cs.utexas.edu/MicoBot/
☆ Towards Human-Centric Evaluation of Interaction-Aware Automated Vehicle Controllers: A Framework and Case Study
As automated vehicles (AVs) increasingly integrate into mixed-traffic
environments, evaluating their interaction with human-driven vehicles (HDVs)
becomes critical. In most research focused on developing new AV control
algorithms (controllers), the performance of these algorithms is assessed
solely based on performance metrics such as collision avoidance or lane-keeping
efficiency, while largely overlooking the human-centred dimensions of
interaction with HDVs. This paper proposes a structured evaluation framework
that addresses this gap by incorporating metrics grounded in the human-robot
interaction literature. The framework spans four key domains: a) interaction
effect, b) interaction perception, c) interaction effort, and d) interaction
ability. These domains capture both the performance of the AV and its impact on
human drivers around it. To demonstrate the utility of the framework, we apply
it to a case study evaluating how a state-of-the-art AV controller interacts
with human drivers in a merging scenario in a driving simulator. Measuring
HDV-HDV interactions as a baseline, this study included one representative
metric per domain: a) perceived safety, b) subjective ratings, specifically how
participants perceived the other vehicle's driving behaviour (e.g.,
aggressiveness or predictability) , c) driver workload, and d) merging success.
The results showed that incorporating metrics covering all four domains in the
evaluation of AV controllers can illuminate critical differences in driver
experience when interacting with AVs. This highlights the need for a more
comprehensive evaluation approach. Our framework offers researchers,
developers, and policymakers a systematic method for assessing AV behaviour
beyond technical performance, fostering the development of AVs that are not
only functionally capable but also understandable, acceptable, and safe from a
human perspective.
☆ Do Robots Really Need Anthropomorphic Hands?
Human manipulation skills represent a pinnacle of their voluntary motor
functions, requiring the coordination of many degrees of freedom and processing
of high-dimensional sensor input to achieve such a high level of dexterity.
Thus, we set out to answer whether the human hand, with its associated
biomechanical properties, sensors, and control mechanisms, is an ideal that we
should strive for in robotics-do we really need anthropomorphic robotic hands?
This survey can help practitioners to make the trade-off between hand
complexity and potential manipulation skills. We provide an overview of the
human hand, a comparison of commercially available robotic and prosthetic
hands, and a systematic review of hand mechanisms and skills that they are
capable of. This leads to follow-up questions. What is the minimum requirement
for mechanisms and sensors to implement most skills that a robot needs? What is
missing to reach human-level dexterity? Can we improve upon human dexterity?
Although complex five-fingered hands are often used as the ultimate goal for
robotic manipulators, they are not necessary for all tasks. We found that wrist
flexibility and finger abduction/adduction are important for manipulation
capabilities. On the contrary, increasing the number of fingers, actuators, or
degrees of freedom is often not necessary. Three fingers are a good compromise
between simplicity and dexterity. Non-anthropomorphic hand designs with two
opposing pairs of fingers or human hands with six fingers can further increase
dexterity, suggesting that the human hand may not be the optimum.
☆ Computational Design and Fabrication of Modular Robots with Untethered Control
Manas Bhargava, Takefumi Hiraki, Malina Strugaru, Michal Piovarci, Chiara Daraio, Daisuke Iwai, Bernd Bickel
Natural organisms use distributed actuation via their musculoskeletal systems
to adapt their gait for traversing diverse terrains or to morph their bodies to
perform varied tasks. A longstanding challenge in the field of robotics is to
mimic this extensive adaptability and range of motion. This has led humans to
develop various soft robotic systems that emulate natural organisms. However,
such systems are generally optimized for a single functionality, lack the
ability to change form or function on demand, or are often tethered to bulky
control systems. To address these challenges, we present our framework for
designing and controlling robots that mimic nature's blueprint by utilizing
distributed actuation. We propose a novel building block that combines
3D-printed bones with liquid crystal elastomer (LCE) muscles as lightweight
actuators and enables the modular assembly of musculoskeletal robots. We
developed LCE rods that contract in response to infrared radiation, thereby
achieving local and untethered control over the distributed network of bones,
which in turn results in global deformation of the robot. Furthermore, to
capitalize on the extensive design space, we develop two computational tools:
one to optimize the robot's skeletal graph, enabling multiple target
deformations, and another to co-optimize the skeletal designs and control gaits
to achieve target locomotion. We validate our system by building several robots
that show complex shape morphing, varying control schemes, and adaptability to
their environment. Our system integrates advances in modular material building,
untethered and distributed control, and computational design to introduce a new
generation of robots that brings us closer to the capabilities of living
organisms.
☆ DistillDrive: End-to-End Multi-Mode Autonomous Driving Distillation by Isomorphic Hetero-Source Planning Model
End-to-end autonomous driving has been recently seen rapid development,
exerting a profound influence on both industry and academia. However, the
existing work places excessive focus on ego-vehicle status as their sole
learning objectives and lacks of planning-oriented understanding, which limits
the robustness of the overall decision-making prcocess. In this work, we
introduce DistillDrive, an end-to-end knowledge distillation-based autonomous
driving model that leverages diversified instance imitation to enhance
multi-mode motion feature learning. Specifically, we employ a planning model
based on structured scene representations as the teacher model, leveraging its
diversified planning instances as multi-objective learning targets for the
end-to-end model. Moreover, we incorporate reinforcement learning to enhance
the optimization of state-to-decision mappings, while utilizing generative
modeling to construct planning-oriented instances, fostering intricate
interactions within the latent space. We validate our model on the nuScenes and
NAVSIM datasets, achieving a 50\% reduction in collision rate and a 3-point
improvement in closed-loop performance compared to the baseline model. Code and
model are publicly available at https://github.com/YuruiAI/DistillDrive
☆ Real-Time Iteration Scheme for Diffusion Policy
Diffusion Policies have demonstrated impressive performance in robotic
manipulation tasks. However, their long inference time, resulting from an
extensive iterative denoising process, and the need to execute an action chunk
before the next prediction to maintain consistent actions limit their
applicability to latency-critical tasks or simple tasks with a short cycle
time. While recent methods explored distillation or alternative policy
structures to accelerate inference, these often demand additional training,
which can be resource-intensive for large robotic models. In this paper, we
introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a
method from optimal control that accelerates optimization by leveraging
solutions from previous time steps as initial guesses for subsequent
iterations. We explore the application of this scheme in diffusion inference
and propose a scaling-based method to effectively handle discrete actions, such
as grasping, in robotic manipulation. The proposed scheme significantly reduces
runtime computational costs without the need for distillation or policy
redesign. This enables a seamless integration into many pre-trained
diffusion-based models, in particular, to resource-demanding large models. We
also provide theoretical conditions for the contractivity which could be useful
for estimating the initial denoising step. Quantitative results from extensive
simulation experiments show a substantial reduction in inference time, with
comparable overall performance compared with Diffusion Policy using full-step
denoising. Our project page with additional resources is available at:
https://rti-dp.github.io/.
comment: \c{opyright} 2025 IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other works
☆ Robots can defuse high-intensity conflict situations IROS
This paper investigates the specific scenario of high-intensity
confrontations between humans and robots, to understand how robots can defuse
the conflict. It focuses on the effectiveness of using five different affective
expression modalities as main drivers for defusing the conflict. The aim is to
discover any strengths or weaknesses in using each modality to mitigate the
hostility that people feel towards a poorly performing robot. The defusing of
the situation is accomplished by making the robot better at acknowledging the
conflict and by letting it express remorse. To facilitate the tests, we used a
custom affective robot in a simulated conflict situation with 105 test
participants. The results show that all tested expression modalities can
successfully be used to defuse the situation and convey an acknowledgment of
the confrontation. The ratings were remarkably similar, but the movement
modality was different (ANON p$<$.05) than the other modalities. The test
participants also had similar affective interpretations on how impacted the
robot was of the confrontation across all expression modalities. This indicates
that defusing a high-intensity interaction may not demand special attention to
the expression abilities of the robot, but rather require attention to the
abilities of being socially aware of the situation and reacting in accordance
with it.
comment: 7 pages, 6 figures, 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS) October 25-29, 2020, Las Vegas, NV, USA
☆ A Multi-view Landmark Representation Approach with Application to GNSS-Visual-Inertial Odometry
Invariant Extended Kalman Filter (IEKF) has been a significant technique in
vision-aided sensor fusion. However, it usually suffers from high computational
burden when jointly optimizing camera poses and the landmarks. To improve its
efficiency and applicability for multi-sensor fusion, we present a multi-view
pose-only estimation approach with its application to GNSS-Visual-Inertial
Odometry (GVIO) in this paper. Our main contribution is deriving a visual
measurement model which directly associates landmark representation with
multiple camera poses and observations. Such a pose-only measurement is proven
to be tightly-coupled between landmarks and poses, and maintain a perfect null
space that is independent of estimated poses. Finally, we apply the proposed
approach to a filter based GVIO with a novel feature management strategy. Both
simulation tests and real-world experiments are conducted to demonstrate the
superiority of the proposed method in terms of efficiency and accuracy.
☆ Affecta-Context: The Context-Guided Behavior Adaptation Framework
This paper presents Affecta-context, a general framework to facilitate
behavior adaptation for social robots. The framework uses information about the
physical context to guide its behaviors in human-robot interactions. It
consists of two parts: one that represents encountered contexts and one that
learns to prioritize between behaviors through human-robot interactions. As
physical contexts are encountered the framework clusters them by their measured
physical properties. In each context, the framework learns to prioritize
between behaviors to optimize the physical attributes of the robot's behavior
in line with its current environment and the preferences of the users it
interacts with. This paper illlustrates the abilities of the Affecta-context
framework by enabling a robot to autonomously learn the prioritization of
discrete behaviors. This was achieved by training across 72 interactions in two
different physical contexts with 6 different human test participants. The paper
demonstrates the trained Affecta-context framework by verifying the robot's
ability to generalize over the input and to match its behaviors to a previously
unvisited physical context.
comment: 6 pages, Intelligent Autonomous Systems 18. IAS 2023
☆ Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control
Teaching robots dexterous skills from human videos remains challenging due to
the reliance on low-level trajectory imitation, which fails to generalize
across object types, spatial layouts, and manipulator configurations. We
propose Graph-Fused Vision-Language-Action (GF-VLA), a framework that enables
dual-arm robotic systems to perform task-level reasoning and execution directly
from RGB and Depth human demonstrations. GF-VLA first extracts
Shannon-information-based cues to identify hands and objects with the highest
task relevance, then encodes these cues into temporally ordered scene graphs
that capture both hand-object and object-object interactions. These graphs are
fused with a language-conditioned transformer that generates hierarchical
behavior trees and interpretable Cartesian motion commands. To improve
execution efficiency in bimanual settings, we further introduce a cross-hand
selection policy that infers optimal gripper assignment without explicit
geometric reasoning. We evaluate GF-VLA on four structured dual-arm block
assembly tasks involving symbolic shape construction and spatial
generalization. Experimental results show that the information-theoretic scene
representation achieves over 95 percent graph accuracy and 93 percent subtask
segmentation, supporting the LLM planner in generating reliable and
human-readable task policies. When executed by the dual-arm robot, these
policies yield 94 percent grasp success, 89 percent placement accuracy, and 90
percent overall task success across stacking, letter-building, and geometric
reconfiguration scenarios, demonstrating strong generalization and robustness
across diverse spatial and semantic variations.
comment: Journal under review
☆ ASkDAgger: Active Skill-level Data Aggregation for Interactive Imitation Learning
Human teaching effort is a significant bottleneck for the broader
applicability of interactive imitation learning. To reduce the number of
required queries, existing methods employ active learning to query the human
teacher only in uncertain, risky, or novel situations. However, during these
queries, the novice's planned actions are not utilized despite containing
valuable information, such as the novice's capabilities, as well as
corresponding uncertainty levels. To this end, we allow the novice to say: "I
plan to do this, but I am uncertain." We introduce the Active Skill-level Data
Aggregation (ASkDAgger) framework, which leverages teacher feedback on the
novice plan in three key ways: (1) S-Aware Gating (SAG): Adjusts the gating
threshold to track sensitivity, specificity, or a minimum success rate; (2)
Foresight Interactive Experience Replay (FIER), which recasts valid and
relabeled novice action plans into demonstrations; and (3) Prioritized
Interactive Experience Replay (PIER), which prioritizes replay based on
uncertainty, novice success, and demonstration age. Together, these components
balance query frequency with failure incidence, reduce the number of required
demonstration annotations, improve generalization, and speed up adaptation to
changing domains. We validate the effectiveness of ASkDAgger through
language-conditioned manipulation tasks in both simulation and real-world
environments. Code, data, and videos are available at
https://askdagger.github.io.
comment: Accepted for publication in Transactions on Machine Learning Research
(TMLR, 2025)
☆ GhostShell: Streaming LLM Function Calls for Concurrent Embodied Programming
Jian Gong, Youwei Huang, Bo Yuan, Ming Zhu, Juncheng Zhan, Jinke Wang, Hang Shu, Mingyue Xiong, Yanjun Ye, Yufan Zu, Yang Zhou, Yihan Ding, Xuannian Chen, Xingyu Lu, Runjie Ban, Bingchao Huang, Fusen Liu
We present GhostShell, a novel approach that leverages Large Language Models
(LLMs) to enable streaming and concurrent behavioral programming for embodied
systems. In contrast to conventional methods that rely on pre-scheduled action
sequences or behavior trees, GhostShell drives embodied systems to act
on-the-fly by issuing function calls incrementally as tokens are streamed from
the LLM. GhostShell features a streaming XML function token parser, a dynamic
function interface mapper, and a multi-channel scheduler that orchestrates
intra-channel synchronous and inter-channel asynchronous function calls,
thereby coordinating serial-parallel embodied actions across multiple robotic
components as directed by the LLM. We evaluate GhostShell on our robot
prototype COCO through comprehensive grounded experiments across 34 real-world
interaction tasks and multiple LLMs. The results demonstrate that our approach
achieves state-of-the-art Behavioral Correctness Metric of 0.85 with Claude-4
Sonnet and up to 66X faster response times compared to LLM native function
calling APIs. GhostShell also proves effective in long-horizon multimodal
tasks, demonstrating strong robustness and generalization.
comment: 17 pages, 5 figures, conference
☆ Towards Embodied Agentic AI: Review and Classification of LLM- and VLM-Driven Robot Autonomy and Interaction
Sahar Salimpour, Lei Fu, Farhad Keramat, Leonardo Militano, Giovanni Toffetti, Harry Edelman, Jorge Peña Queralta
Foundation models, including large language models (LLMs) and vision-language
models (VLMs), have recently enabled novel approaches to robot autonomy and
human-robot interfaces. In parallel, vision-language-action models (VLAs) or
large behavior models (BLMs) are increasing the dexterity and capabilities of
robotic systems. This survey paper focuses on those words advancing towards
agentic applications and architectures. This includes initial efforts exploring
GPT-style interfaces to tooling, as well as more complex system where AI agents
are coordinators, planners, perception actors, or generalist interfaces. Such
agentic architectures allow robots to reason over natural language
instructions, invoke APIs, plan task sequences, or assist in operations and
diagnostics. In addition to peer-reviewed research, due to the fast-evolving
nature of the field, we highlight and include community-driven projects, ROS
packages, and industrial frameworks that show emerging trends. We propose a
taxonomy for classifying model integration approaches and present a comparative
analysis of the role that agents play in different solutions in today's
literature.
☆ Dancing with a Robot: An Experimental Study of Child-Robot Interaction in a Performative Art Setting
This paper presents an evaluation of 18 children's in-the-wild experiences
with the autonomous robot arm performer NED (Never-Ending Dancer) within the
Thingamabobas installation, showcased across the UK. We detail NED's design,
including costume, behaviour, and human interactions, all integral to the
installation. Our observational analysis revealed three key challenges in
child-robot interactions: 1) Initiating and maintaining engagement, 2) Lack of
robot expressivity and reciprocity, and 3) Unmet expectations. Our findings
show that children are naturally curious, and adept at interacting with a
robotic art performer. However, our observations emphasise the critical need to
optimise human-robot interaction (HRI) systems through careful consideration of
audience's capabilities, perceptions, and expectations, within the performative
arts context, to enable engaging and meaningful experiences, especially for
young audiences.
comment: published by Springer
☆ Learning to See and Act: Task-Aware View Planning for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Weixing Chen, Ziliang Chen, Mingtong Dai, Yongsen Zheng, Lingbo Liu, Guanbin Li, Liang Lin
Recent vision-language-action (VLA) models for multi-task robotic
manipulation commonly rely on static viewpoints and shared visual encoders,
which limit 3D perception and cause task interference, hindering robustness and
generalization. In this work, we propose Task-Aware View Planning (TAVP), a
framework designed to overcome these challenges by integrating active view
planning with task-specific representation learning. TAVP employs an efficient
exploration policy, accelerated by a novel pseudo-environment, to actively
acquire informative views. Furthermore, we introduce a Mixture-of-Experts (MoE)
visual encoder to disentangle features across different tasks, boosting both
representation fidelity and task generalization. By learning to see the world
in a task-aware way, TAVP generates more complete and discriminative visual
representations, demonstrating significantly enhanced action prediction across
a wide array of manipulation challenges. Extensive experiments on RLBench tasks
show that our proposed TAVP model achieves superior performance over
state-of-the-art fixed-view approaches. Visual results and code are provided
at: https://hcplab-sysu.github.io/TAVP.
comment: 7 pages, 9 figures, project page: https://hcplab-sysu.github.io/TAVP
☆ FCBV-Net: Category-Level Robotic Garment Smoothing via Feature-Conditioned Bimanual Value Prediction
Category-level generalization for robotic garment manipulation, such as
bimanual smoothing, remains a significant hurdle due to high dimensionality,
complex dynamics, and intra-category variations. Current approaches often
struggle, either overfitting with concurrently learned visual features for a
specific instance or, despite category-level perceptual generalization, failing
to predict the value of synergistic bimanual actions. We propose the
Feature-Conditioned Bimanual Value Network (FCBV-Net), operating on 3D point
clouds to specifically enhance category-level policy generalization for garment
smoothing. FCBV-Net conditions bimanual action value prediction on pre-trained,
frozen dense geometric features, ensuring robustness to intra-category garment
variations. Trainable downstream components then learn a task-specific policy
using these static features. In simulated GarmentLab experiments with the
CLOTH3D dataset, FCBV-Net demonstrated superior category-level generalization.
It exhibited only an 11.5% efficiency drop (Steps80) on unseen garments
compared to 96.2% for a 2D image-based baseline, and achieved 89% final
coverage, outperforming an 83% coverage from a 3D correspondence-based baseline
that uses identical per-point geometric features but a fixed primitive. These
results highlight that the decoupling of geometric understanding from bimanual
action value learning enables better category-level generalization.
comment: 7 pages, 3 figures, 1 table. Submitted to IEEE Robotics and
Automation Letters
☆ Chemist Eye: A Visual Language Model-Powered System for Safety Monitoring and Robot Decision-Making in Self-Driving Laboratories
Francisco Munguia-Galeano, Zhengxue Zhou, Satheeshkumar Veeramani, Hatem Fakhruldeen, Louis Longley, Rob Clowes, Andrew I. Cooper
The integration of robotics and automation into self-driving laboratories
(SDLs) can introduce additional safety complexities, in addition to those that
already apply to conventional research laboratories. Personal protective
equipment (PPE) is an essential requirement for ensuring the safety and
well-being of workers in laboratories, self-driving or otherwise. Fires are
another important risk factor in chemical laboratories. In SDLs, fires that
occur close to mobile robots, which use flammable lithium batteries, could have
increased severity. Here, we present Chemist Eye, a distributed safety
monitoring system designed to enhance situational awareness in SDLs. The system
integrates multiple stations equipped with RGB, depth, and infrared cameras,
designed to monitor incidents in SDLs. Chemist Eye is also designed to spot
workers who have suffered a potential accident or medical emergency, PPE
compliance and fire hazards. To do this, Chemist Eye uses decision-making
driven by a vision-language model (VLM). Chemist Eye is designed for seamless
integration, enabling real-time communication with robots. Based on the VLM
recommendations, the system attempts to drive mobile robots away from potential
fire locations, exits, or individuals not wearing PPE, and issues audible
warnings where necessary. It also integrates with third-party messaging
platforms to provide instant notifications to lab personnel. We tested Chemist
Eye with real-world data from an SDL equipped with three mobile robots and
found that the spotting of possible safety hazards and decision-making
performances reached 97 % and 95 %, respectively.
☆ From Canada to Japan: How 10,000 km Affect User Perception in Robot Teleoperation
Siméon Capy, Thomas M. Kwok, Kevin Joseph, Yuichiro Kawasumi, Koichi Nagashima, Tomoya Sasaki, Yue Hu, Eiichi Yoshida
Robot teleoperation (RTo) has emerged as a viable alternative to local
control, particularly when human intervention is still necessary. This research
aims to study the distance effect on user perception in RTo, exploring the
potential of teleoperated robots for older adult care. We propose an evaluation
of non-expert users' perception of long-distance RTo, examining how their
perception changes before and after interaction, as well as comparing it to
that of locally operated robots. We have designed a specific protocol
consisting of multiple questionnaires, along with a dedicated software
architecture using the Robotics Operating System (ROS) and Unity. The results
revealed no statistically significant differences between the local and remote
robot conditions, suggesting that robots may be a viable alternative to
traditional local control.
comment: Author preprint - Accepted for Humanoids 2025
☆ Examining the legibility of humanoid robot arm movements in a pointing task
Andrej Lúčny, Matilde Antonj, Carlo Mazzola, Hana Hornáčková, Ana Farić, Kristína Malinovská, Michal Vavrecka, Igor Farkaš
Human--robot interaction requires robots whose actions are legible, allowing
humans to interpret, predict, and feel safe around them. This study
investigates the legibility of humanoid robot arm movements in a pointing task,
aiming to understand how humans predict robot intentions from truncated
movements and bodily cues. We designed an experiment using the NICO humanoid
robot, where participants observed its arm movements towards targets on a
touchscreen. Robot cues varied across conditions: gaze, pointing, and pointing
with congruent or incongruent gaze. Arm trajectories were stopped at 60\% or
80\% of their full length, and participants predicted the final target. We
tested the multimodal superiority and ocular primacy hypotheses, both of which
were supported by the experiment.
comment: Published at ICSR 2025
☆ Analyzing the Impact of Multimodal Perception on Sample Complexity and Optimization Landscapes in Imitation Learning
This paper examines the theoretical foundations of multimodal imitation
learning through the lens of statistical learning theory. We analyze how
multimodal perception (RGB-D, proprioception, language) affects sample
complexity and optimization landscapes in imitation policies. Building on
recent advances in multimodal learning theory, we show that properly integrated
multimodal policies can achieve tighter generalization bounds and more
favorable optimization landscapes than their unimodal counterparts. We provide
a comprehensive review of theoretical frameworks that explain why multimodal
architectures like PerAct and CLIPort achieve superior performance, connecting
these empirical results to fundamental concepts in Rademacher complexity, PAC
learning, and information theory.
comment: 9 pages, 1 figure, 1 table, theoretical analysis with empirical
validation on PerAct implementation in MuJoCo simulation environment
☆ A Vision-Based Collision Sensing Method for Stable Circular Object Grasping with A Soft Gripper System
External collisions to robot actuators typically pose risks to grasping
circular objects. This work presents a vision-based sensing module capable of
detecting collisions to maintain stable grasping with a soft gripper system.
The system employs an eye-in-palm camera with a broad field of view to
simultaneously monitor the motion of fingers and the grasped object.
Furthermore, we have developed a collision-rich grasping strategy to ensure the
stability and security of the entire dynamic grasping process. A physical soft
gripper was manufactured and affixed to a collaborative robotic arm to evaluate
the performance of the collision detection mechanism. An experiment regarding
testing the response time of the mechanism confirmed the system has the
capability to react to the collision instantaneously. A dodging test was
conducted to demonstrate the gripper can detect the direction and scale of
external collisions precisely.
☆ Benchmarking Shortcutting Techniques for Multi-Robot-Arm Motion Planning IROS 2025
Generating high-quality motion plans for multiple robot arms is challenging
due to the high dimensionality of the system and the potential for inter-arm
collisions. Traditional motion planning methods often produce motions that are
suboptimal in terms of smoothness and execution time for multi-arm systems.
Post-processing via shortcutting is a common approach to improve motion quality
for efficient and smooth execution. However, in multi-arm scenarios, optimizing
one arm's motion must not introduce collisions with other arms. Although
existing multi-arm planning works often use some form of shortcutting
techniques, their exact methodology and impact on performance are often vaguely
described. In this work, we present a comprehensive study quantitatively
comparing existing shortcutting methods for multi-arm trajectories across
diverse simulated scenarios. We carefully analyze the pros and cons of each
shortcutting method and propose two simple strategies for combining these
methods to achieve the best performance-runtime tradeoff. Video, code, and
dataset are available at https://philip-huang.github.io/mr-shortcut/.
comment: 9 pages, 6 figures, accepted for publication at 2025 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS 2025)
☆ MAG-Nav: Language-Driven Object Navigation Leveraging Memory-Reserved Active Grounding
Visual navigation in unknown environments based solely on natural language
descriptions is a key capability for intelligent robots. In this work, we
propose a navigation framework built upon off-the-shelf Visual Language Models
(VLMs), enhanced with two human-inspired mechanisms: perspective-based active
grounding, which dynamically adjusts the robot's viewpoint for improved visual
inspection, and historical memory backtracking, which enables the system to
retain and re-evaluate uncertain observations over time. Unlike existing
approaches that passively rely on incidental visual inputs, our method actively
optimizes perception and leverages memory to resolve ambiguity, significantly
improving vision-language grounding in complex, unseen environments. Our
framework operates in a zero-shot manner, achieving strong generalization to
diverse and open-ended language descriptions without requiring labeled data or
model fine-tuning. Experimental results on Habitat-Matterport 3D (HM3D) show
that our method outperforms state-of-the-art approaches in language-driven
object navigation. We further demonstrate its practicality through real-world
deployment on a quadruped robot, achieving robust and effective navigation
performance.
☆ Hierarchical Deep Deterministic Policy Gradient for Autonomous Maze Navigation of Mobile Robots
Maze navigation is a fundamental challenge in robotics, requiring agents to
traverse complex environments efficiently. While the Deep Deterministic Policy
Gradient (DDPG) algorithm excels in control tasks, its performance in maze
navigation suffers from sparse rewards, inefficient exploration, and
long-horizon planning difficulties, often leading to low success rates and
average rewards, sometimes even failing to achieve effective navigation. To
address these limitations, this paper proposes an efficient Hierarchical DDPG
(HDDPG) algorithm, which includes high-level and low-level policies. The
high-level policy employs an advanced DDPG framework to generate intermediate
subgoals from a long-term perspective and on a higher temporal scale. The
low-level policy, also powered by the improved DDPG algorithm, generates
primitive actions by observing current states and following the subgoal
assigned by the high-level policy. The proposed method enhances stability with
off-policy correction, refining subgoal assignments by relabeling historical
experiences. Additionally, adaptive parameter space noise is utilized to
improve exploration, and a reshaped intrinsic-extrinsic reward function is
employed to boost learning efficiency. Further optimizations, including
gradient clipping and Xavier initialization, are employed to improve
robustness. The proposed algorithm is rigorously evaluated through numerical
simulation experiments executed using the Robot Operating System (ROS) and
Gazebo. Regarding the three distinct final targets in autonomous maze
navigation tasks, HDDPG significantly overcomes the limitations of standard
DDPG and its variants, improving the success rate by at least 56.59% and
boosting the average reward by a minimum of 519.03 compared to baseline
algorithms.
☆ Optimal Planning for Multi-Robot Simultaneous Area and Line Coverage Using Hierarchical Cyclic Merging Regulation
The double coverage problem focuses on determining efficient, collision-free
routes for multiple robots to simultaneously cover linear features (e.g.,
surface cracks or road routes) and survey areas (e.g., parking lots or local
regions) in known environments. In these problems, each robot carries two
functional roles: service (linear feature footprint coverage) and exploration
(complete area coverage). Service has a smaller operational footprint but
incurs higher costs (e.g., time) compared to exploration. We present optimal
planning algorithms for the double coverage problems using hierarchical cyclic
merging regulation (HCMR). To reduce the complexity for optimal planning
solutions, we analyze the manifold attachment process during graph traversal
from a Morse theory perspective. We show that solutions satisfying minimum path
length and collision-free constraints must belong to a Morse-bounded
collection. To identify this collection, we introduce the HCMR algorithm. In
HCMR, cyclic merging search regulates traversal behavior, while edge sequence
back propagation converts these regulations into graph edge traversal
sequences. Incorporating balanced partitioning, the optimal sequence is
selected to generate routes for each robot. We prove the optimality of the HCMR
algorithm under a fixed sweep direction. The multi-robot simulation results
demonstrate that the HCMR algorithm significantly improves planned path length
by at least 10.0%, reduces task time by at least 16.9% in average, and ensures
conflict-free operation compared to other state-of-the-art planning methods.
♻ ☆ Diffusion Beats Autoregressive in Data-Constrained Settings
Autoregressive (AR) models have long dominated the landscape of large
language models, driving progress across a wide range of tasks. Recently,
diffusion-based language models have emerged as a promising alternative, though
their advantages over AR models remain underexplored. In this paper, we
systematically study masked diffusion models in data-constrained settings-where
training involves repeated passes over limited data-and find that they
significantly outperform AR models when compute is abundant but data is scarce.
Diffusion models make better use of repeated data, achieving lower validation
loss and superior downstream performance. We interpret this advantage as
implicit data augmentation: masked diffusion exposes the model to a diverse
distribution of token orderings and prediction tasks, unlike AR's fixed
left-to-right factorization. We find new scaling laws for diffusion models and
derive a closed-form expression for the critical compute threshold at which
diffusion begins to outperform AR. These results suggest that when data, not
compute, is the bottleneck, diffusion models offer a compelling alternative to
the standard AR paradigm. Our code is available at:
https://diffusion-scaling.github.io.
comment: Project Webpage: https://diffusion-scaling.github.io
♻ ☆ Bayesian Optimization applied for accelerated Virtual Validation of the Autonomous Driving Function
Satyesh Shanker Awasthi, Mohammed Irshadh Ismaaeel Sathyamangalam Imran, Stefano Arrigoni, Francesco Braghin
Rigorous Verification and Validation (V&V) of Autonomous Driving Functions
(ADFs) is paramount for ensuring the safety and public acceptance of Autonomous
Vehicles (AVs). Current validation relies heavily on simulation to achieve
sufficient test coverage within the Operational Design Domain (ODD) of a
vehicle, but exhaustively exploring the vast parameter space of possible
scenarios is computationally expensive and time-consuming. This work introduces
a framework based on Bayesian Optimization (BO) to accelerate the discovery of
critical scenarios. We demonstrate the effectiveness of the framework on an
Model Predictive Controller (MPC)-based motion planner, showing that it
identifies hazardous situations, such as off-road events, using orders of
magnitude fewer simulations than brute-force Design of Experiments (DoE)
methods. Furthermore, this study investigates the scalability of the framework
in higher-dimensional parameter spaces and its ability to identify multiple,
distinct critical regions within the ODD of the motion planner used as the case
study .
comment: 12 pages, corrected author list of references 27 and 38, removed
duplicate reference of reference 6
♻ ☆ Di-NeRF: Distributed NeRF for Collaborative Learning with Relative Pose Refinement
Collaborative mapping of unknown environments can be done faster and more
robustly than a single robot. However, a collaborative approach requires a
distributed paradigm to be scalable and deal with communication issues. This
work presents a fully distributed algorithm enabling a group of robots to
collectively optimize the parameters of a Neural Radiance Field (NeRF). The
algorithm involves the communication of each robot's trained NeRF parameters
over a mesh network, where each robot trains its NeRF and has access to its own
visual data only. Additionally, the relative poses of all robots are jointly
optimized alongside the model parameters, enabling mapping with less accurate
relative camera poses. We show that multi-robot systems can benefit from
differentiable and robust 3D reconstruction optimized from multiple NeRFs.
Experiments on real-world and synthetic data demonstrate the efficiency of the
proposed algorithm. See the website of the project for videos of the
experiments and supplementary material
(https://sites.google.com/view/di-nerf/home).
comment: 9 pages, 11 figures, Accepted in IEEE-RA-L
♻ ☆ Fast and Robust Visuomotor Riemannian Flow Matching Policy
Diffusion-based visuomotor policies excel at learning complex robotic tasks
by effectively combining visual data with high-dimensional, multi-modal action
distributions. However, diffusion models often suffer from slow inference due
to costly denoising processes or require complex sequential training arising
from recent distilling approaches. This paper introduces Riemannian Flow
Matching Policy (RFMP), a model that inherits the easy training and fast
inference capabilities of flow matching (FM). Moreover, RFMP inherently
incorporates geometric constraints commonly found in realistic robotic
applications, as the robot state resides on a Riemannian manifold. To enhance
the robustness of RFMP, we propose Stable RFMP (SRFMP), which leverages
LaSalle's invariance principle to equip the dynamics of FM with stability to
the support of a target Riemannian distribution. Rigorous evaluation on ten
simulated and real-world tasks show that RFMP successfully learns and
synthesizes complex sensorimotor policies on Euclidean and Riemannian spaces
with efficient training and inference phases, outperforming Diffusion Policies
and Consistency Policies.
comment: Accepted for publication in IEEE T-RO. Project website:
https://sites.google.com/view/rfmp 17 pages, 12 figures, 12 tables
♻ ☆ Reality Fusion: Robust Real-time Immersive Mobile Robot Teleoperation with Volumetric Visual Data Fusion IROS 2024
We introduce Reality Fusion, a novel robot teleoperation system that
localizes, streams, projects, and merges a typical onboard depth sensor with a
photorealistic, high resolution, high framerate, and wide field of view (FoV)
rendering of the complex remote environment represented as 3D Gaussian splats
(3DGS). Our framework enables robust egocentric and exocentric robot
teleoperation in immersive VR, with the 3DGS effectively extending spatial
information of a depth sensor with limited FoV and balancing the trade-off
between data streaming costs and data visual quality. We evaluated our
framework through a user study with 24 participants, which revealed that
Reality Fusion leads to significantly better user performance, situation
awareness, and user preferences. To support further research and development,
we provide an open-source implementation with an easy-to-replicate custom-made
telepresence robot, a high-performance virtual reality 3DGS renderer, and an
immersive robot control package. (Source code:
https://github.com/uhhhci/RealityFusion)
comment: Accepted at IROS 2024
♻ ☆ Motion Planning Diffusion: Learning and Adapting Robot Motion Planning with Diffusion Models
The performance of optimization-based robot motion planning algorithms is
highly dependent on the initial solutions, commonly obtained by running a
sampling-based planner to obtain a collision-free path. However, these methods
can be slow in high-dimensional and complex scenes and produce non-smooth
solutions. Given previously solved path-planning problems, it is highly
desirable to learn their distribution and use it as a prior for new similar
problems. Several works propose utilizing this prior to bootstrap the motion
planning problem, either by sampling initial solutions from it, or using its
distribution in a maximum-a-posterior formulation for trajectory optimization.
In this work, we introduce Motion Planning Diffusion (MPD), an algorithm that
learns trajectory distribution priors with diffusion models. These generative
models have shown increasing success in encoding multimodal data and have
desirable properties for gradient-based motion planning, such as cost guidance.
Given a motion planning problem, we construct a cost function and sample from
the posterior distribution using the learned prior combined with the cost
function gradients during the denoising process. Instead of learning the prior
on all trajectory waypoints, we propose learning a lower-dimensional
representation of a trajectory using linear motion primitives, particularly
B-spline curves. This parametrization guarantees that the generated trajectory
is smooth, can be interpolated at higher frequencies, and needs fewer
parameters than a dense waypoint representation. We demonstrate the results of
our method ranging from simple 2D to more complex tasks using a 7-dof robot arm
manipulator. In addition to learning from simulated data, we also use human
demonstrations on a real-world pick-and-place task.
♻ ☆ Quaternion-Based Sliding Mode Control for Six Degrees of Freedom Flight Control of Quadrotors
Despite extensive research on sliding mode control (SMC) design for
quadrotors, the existing approaches suffer from certain limitations. Euler
angle-based SMC formulations suffer from poor performance in high-pitch or
-roll maneuvers. Quaternion-based SMC approaches have unwinding issues and
complex architecture. Coordinate-free methods are slow and only almost globally
stable. This paper presents a new six degrees of freedom SMC flight controller
to address the above limitations. We use a cascaded architecture with a
position controller in the outer loop and a quaternion-based attitude
controller in the inner loop. The position controller generates the desired
trajectory for the attitude controller using a coordinate-free approach. The
quaternion-based attitude controller uses the natural characteristics of the
quaternion hypersphere, featuring a simple structure while providing global
stability and avoiding unwinding issues. We compare our controller with three
other common control methods conducting challenging maneuvers like flip-over
and high-speed trajectory tracking in the presence of model uncertainties and
disturbances. Our controller consistently outperforms the benchmark approaches
with less control effort and actuator saturation, offering highly effective and
efficient flight control.
♻ ☆ WeatherEdit: Controllable Weather Editing with 4D Gaussian Field
In this work, we present WeatherEdit, a novel weather editing pipeline for
generating realistic weather effects with controllable types and severity in 3D
scenes. Our approach is structured into two key components: weather background
editing and weather particle construction. For weather background editing, we
introduce an all-in-one adapter that integrates multiple weather styles into a
single pretrained diffusion model, enabling the generation of diverse weather
effects in 2D image backgrounds. During inference, we design a Temporal-View
(TV-) attention mechanism that follows a specific order to aggregate temporal
and spatial information, ensuring consistent editing across multi-frame and
multi-view images. To construct the weather particles, we first reconstruct a
3D scene using the edited images and then introduce a dynamic 4D Gaussian field
to generate snowflakes, raindrops and fog in the scene. The attributes and
dynamics of these particles are precisely controlled through physical-based
modelling and simulation, ensuring realistic weather representation and
flexible severity adjustments. Finally, we integrate the 4D Gaussian field with
the 3D scene to render consistent and highly realistic weather effects.
Experiments on multiple driving datasets demonstrate that WeatherEdit can
generate diverse weather effects with controllable condition severity,
highlighting its potential for autonomous driving simulation in adverse
weather. See project page: https://jumponthemoon.github.io/w-edit
♻ ☆ Vector Quantized-Elites: Unsupervised and Problem-Agnostic Quality-Diversity Optimization
Quality-Diversity algorithms have transformed optimization by prioritizing
the discovery of diverse, high-performing solutions over a single optimal
result. However, traditional Quality-Diversity methods, such as MAP-Elites,
rely heavily on predefined behavior descriptors and complete prior knowledge of
the task to define the behavior space grid, limiting their flexibility and
applicability. In this work, we introduce Vector Quantized-Elites (VQ-Elites),
a novel Quality-Diversity algorithm that autonomously constructs a structured
behavior space grid using unsupervised learning, eliminating the need for prior
task-specific knowledge. At the core of VQ-Elites is the integration of Vector
Quantized Variational Autoencoders, which enables the dynamic learning of
behavior descriptors and the generation of a structured, rather than
unstructured, behavior space grid -- a significant advancement over existing
unsupervised Quality-Diversity approaches. This design establishes VQ-Elites as
a flexible, robust, and task-agnostic optimization framework. To further
enhance the performance of unsupervised Quality-Diversity algorithms, we
introduce behavior space bounding and cooperation mechanisms, which
significantly improve convergence and performance, as well as the Effective
Diversity Ratio and Coverage Diversity Score, two novel metrics that quantify
the actual diversity in the unsupervised setting. We validate VQ-Elites on
robotic arm pose-reaching, mobile robot space-covering, and MiniGrid
exploration tasks. The results demonstrate its ability to efficiently generate
diverse, high-quality solutions, emphasizing its adaptability, scalability,
robustness to hyperparameters, and potential to extend Quality-Diversity
optimization to complex, previously inaccessible domains.
comment: 15 pages, 13 figures, 1 algorithm, 1 table
♻ ☆ "Set It Up": Functional Object Arrangement with Compositional Generative Models (Journal Version)
Functional object arrangement (FORM) is the task of arranging objects to
fulfill a function, e.g., "set up a dining table for two". One key challenge
here is that the instructions for FORM are often under-specified and do not
explicitly specify the desired object goal poses. This paper presents SetItUp,
a neuro-symbolic framework that learns to specify the goal poses of objects
from a few training examples and a structured natural-language task
specification. SetItUp uses a grounding graph, which is composed of abstract
spatial relations among objects (e.g., left-of), as its intermediate
representation. This decomposes the FORM problem into two stages: (i)
predicting this graph among objects and (ii) predicting object poses given the
grounding graph. For (i), SetItUp leverages large language models (LLMs) to
induce Python programs from a task specification and a few training examples.
This program can be executed to generate grounding graphs in novel scenarios.
For (ii), SetItUp pre-trains a collection of diffusion models to capture
primitive spatial relations and online composes these models to predict object
poses based on the grounding graph. We evaluated SetItUp on a dataset spanning
three distinct task families: arranging tableware on a dining table, organizing
items on a bookshelf, and laying out furniture in a bedroom. Experiments show
that SetItUp outperforms existing models in generating functional, physically
feasible, and aesthetically pleasing object arrangements. This article extends
our conference paper published at Robotics: Science and Systems (RSS) 2024.
comment: This is the journal version accepted to the International Journal of
Robotics Research (IJRR). It extends our prior work presented at Robotics:
Science and Systems (RSS) 2024, with a new compositional program induction
pipeline from natural language, and expanded evaluations on personalized
bookshelf and bedroom furniture layout tasks
♻ ☆ RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Lifelong Learning in Physical Embodied Systems
Mingcong Lei, Honghao Cai, Binbin Que, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han
We present RoboMemory, a brain-inspired multi-memory framework for lifelong
learning in physical embodied systems, addressing critical challenges in
real-world environments: continuous learning, multi-module memory latency, task
correlation capture, and infinite-loop mitigation in closed-loop planning.
Grounded in cognitive neuroscience, it integrates four core modules: the
Information Preprocessor (thalamus-like), the Lifelong Embodied Memory System
(hippocampus-like), the Closed-Loop Planning Module (prefrontal lobe-like), and
the Low-Level Executer (cerebellum-like) to enable long-term planning and
cumulative learning. The Lifelong Embodied Memory System, central to the
framework, alleviates inference speed issues in complex memory frameworks via
parallelized updates/retrieval across Spatial, Temporal, Episodic, and Semantic
submodules. It incorporates a dynamic Knowledge Graph (KG) and consistent
architectural design to enhance memory consistency and scalability. Evaluations
on EmbodiedBench show RoboMemory outperforms the open-source baseline
(Qwen2.5-VL-72B-Ins) by 25% in average success rate and surpasses the
closed-source State-of-the-Art (SOTA) (Claude3.5-Sonnet) by 5%, establishing
new SOTA. Ablation studies validate key components (critic, spatial memory,
long-term memory), while real-world deployment confirms its lifelong learning
capability with significantly improved success rates across repeated tasks.
RoboMemory alleviates high latency challenges with scalability, serving as a
foundational reference for integrating multi-modal memory systems in physical
robots.
♻ ☆ Unified Linear Parametric Map Modeling and Perception-aware Trajectory Planning for Mobile Robotics
Autonomous navigation in mobile robots, reliant on perception and planning,
faces major hurdles in large-scale, complex environments. These include heavy
computational burdens for mapping, sensor occlusion failures for UAVs, and
traversal challenges on irregular terrain for UGVs, all compounded by a lack of
perception-aware strategies. To address these challenges, we introduce Random
Mapping and Random Projection (RMRP). This method constructs a lightweight
linear parametric map by first mapping data to a high-dimensional space,
followed by a sparse random projection for dimensionality reduction. Our novel
Residual Energy Preservation Theorem provides theoretical guarantees for this
process, ensuring critical geometric properties are preserved. Based on this
map, we propose the RPATR (Robust Perception-Aware Trajectory Planner)
framework. For UAVs, our method unifies grid and Euclidean Signed Distance
Field (ESDF) maps. The front-end uses an analytical occupancy gradient to
refine initial paths for safety and smoothness, while the back-end uses a
closed-form ESDF for trajectory optimization. Leveraging the trained RMRP
model's generalization, the planner predicts unobserved areas for proactive
navigation. For UGVs, the model characterizes terrain and provides closed-form
gradients, enabling online planning to circumvent large holes. Validated in
diverse scenarios, our framework demonstrates superior mapping performance in
time, memory, and accuracy, and enables computationally efficient, safe
navigation for high-speed UAVs and UGVs. The code will be released to foster
community collaboration.
♻ ☆ $\textit{RoboTron-Nav}$: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction ICCV 2025
In language-guided visual navigation, agents locate target objects in unseen
environments using natural language instructions. For reliable navigation in
unfamiliar scenes, agents should possess strong perception, planning, and
prediction capabilities. Additionally, when agents revisit previously explored
areas during long-term navigation, they may retain irrelevant and redundant
historical perceptions, leading to suboptimal results. In this work, we propose
$\textbf{RoboTron-Nav}$, a unified framework that integrates perception,
planning, and prediction capabilities through multitask collaborations on
navigation and embodied question answering tasks, thereby enhancing navigation
performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history
sampling strategy to effectively and efficiently utilize historical
observations. By leveraging large language model, RoboTron-Nav comprehends
diverse commands and complex visual scenes, resulting in appropriate navigation
actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation
on the $\mathrm{CHORES}$-$\mathbb{S}$ benchmark, setting a new state-of-the-art
performance. Project page: https://yvfengzhong.github.io/RoboTron-Nav
comment: ICCV 2025
♻ ☆ Optimizing Mesh to Improve the Triangular Expansion Algorithm for Computing Visibility Regions
This paper addresses the problem of improving the query performance of the
triangular expansion algorithm (TEA) for computing visibility regions by
finding the most advantageous instance of the triangular mesh, the
preprocessing structure. The TEA recursively traverses the mesh while keeping
track of the visible region, the set of all points visible from a query point
in a polygonal world. We show that the measured query time is approximately
proportional to the number of triangle edge expansions during the mesh
traversal. We propose a new type of triangular mesh that minimizes the expected
number of expansions assuming the query points are drawn from a known
probability distribution. We design a heuristic method to approximate the mesh
and evaluate the approach on many challenging instances that resemble
real-world environments. The proposed mesh improves the mean query times by
12-16% compared to the reference constrained Delaunay triangulation. The
approach is suitable to boost offline applications that require computing
millions of queries without addressing the preprocessing time. The
implementation is publicly available to replicate our experiments and serve the
community.
comment: 30 pages, 43 figures (including subfigures), accepted version
♻ ☆ CARE: Enhancing Safety of Visual Navigation through Collision Avoidance via Repulsive Estimation
We propose CARE (Collision Avoidance via Repulsive Estimation) to improve the
robustness of learning-based visual navigation methods. Recently, visual
navigation models, particularly foundation models, have demonstrated promising
performance by generating viable trajectories using only RGB images. However,
these policies can generalize poorly to environments containing
out-of-distribution (OOD) scenes characterized by unseen objects or different
camera setups (e.g., variations in field of view, camera pose, or focal
length). Without fine-tuning, such models could produce trajectories that lead
to collisions, necessitating substantial efforts in data collection and
additional training. To address this limitation, we introduce CARE, an
attachable module that enhances the safety of visual navigation without
requiring additional range sensors or fine-tuning of pretrained models. CARE
can be integrated seamlessly into any RGB-based navigation model that generates
local robot trajectories. It dynamically adjusts trajectories produced by a
pretrained model using repulsive force vectors computed from depth images
estimated directly from RGB inputs. We evaluate CARE by integrating it with
state-of-the-art visual navigation models across diverse robot platforms.
Real-world experiments show that CARE significantly reduces collisions (up to
100%) without compromising navigation performance in goal-conditioned
navigation, and further improves collision-free travel distance (up to 10.7x)
in exploration tasks. Project page: https://airlab-sogang.github.io/CARE/
comment: 16 pages, 6 figures
♻ ☆ Position-Based Flocking for Robust Alignment
This paper presents a position-based flocking model for interacting agents,
balancing cohesion-separation and alignment to achieve stable collective
motion. The model modifies a position-velocity-based approach by approximating
velocity differences using initial and current positions, introducing a
threshold weight to ensure sustained alignment. Simulations with 50 agents in
2D demonstrate that the position-based model produces stronger alignment and
more rigid and compact formations compared to the position-velocity-based
model. The alignment metric and separation distances highlight the efficacy of
the proposed model in achieving robust flocking behavior. The model's use of
positions ensures robust alignment, with applications in robotics and
collective dynamics.
comment: Accepted for "The 9th International Symposium on Swarm Behavior and
Bio-Inspired Robotics 2025" Simulation video for Fig. 1 at
https://youtu.be/yID0taa7X7o Simulation video for Fig. 3 at
https://youtu.be/8I_yBhY2imo Simulation video for Fig. 5 at
https://youtu.be/fTG7pgBPMZg
♻ ☆ Extracting Visual Plans from Unlabeled Videos via Symbolic Guidance
Visual planning, by offering a sequence of intermediate visual subgoals to a
goal-conditioned low-level policy, achieves promising performance on
long-horizon manipulation tasks. To obtain the subgoals, existing methods
typically resort to video generation models but suffer from model hallucination
and computational cost. We present Vis2Plan, an efficient, explainable and
white-box visual planning framework powered by symbolic guidance. From raw,
unlabeled play data, Vis2Plan harnesses vision foundation models to
automatically extract a compact set of task symbols, which allows building a
high-level symbolic transition graph for multi-goal, multi-stage planning. At
test time, given a desired task goal, our planner conducts planning at the
symbolic level and assembles a sequence of physically consistent intermediate
sub-goal images grounded by the underlying symbolic representation. Our
Vis2Plan outperforms strong diffusion video generation-based visual planners by
delivering 53\% higher aggregate success rate in real robot settings while
generating visual plans 35$\times$ faster. The results indicate that Vis2Plan
is able to generate physically consistent image goals while offering fully
inspectable reasoning steps.
♻ ☆ RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving ICCV 2025
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension
and interpretation capabilities in Autonomous Driving (AD) by incorporating
large language models. Despite the advancements, current data-driven AD
approaches tend to concentrate on a single dataset and specific tasks,
neglecting their overall capabilities and ability to generalize. To bridge
these gaps, we propose RoboTron-Drive, a general large multimodal model
designed to process diverse data inputs, such as images and multi-view videos,
while performing a broad spectrum of AD tasks, including perception,
prediction, and planning. Initially, the model undergoes curriculum
pre-training to process varied visual signals and perform basic visual
comprehension and perception tasks. Subsequently, we augment and standardize
various AD datasets to finetune the model, resulting in an all-in-one LMM for
autonomous driving. To assess the general capabilities and generalization
ability, we conduct evaluations on six public benchmarks and undertake
zero-shot transfer on three unseen datasets, where RoboTron-Drive achieves
state-of-the-art performance across all tasks. We hope RoboTron-Drive as a
promising solution for AD in the real world. Project page with code:
https://github.com/zhijian11/RoboTron-Drive.
comment: ICCV 2025
♻ ☆ APEX-MR: Multi-Robot Asynchronous Planning and Execution for Cooperative Assembly
Compared to a single-robot workstation, a multi-robot system offers several
advantages: 1) it expands the system's workspace, 2) improves task efficiency,
and, more importantly, 3) enables robots to achieve significantly more complex
and dexterous tasks, such as cooperative assembly. However, coordinating the
tasks and motions of multiple robots is challenging due to issues, e.g., system
uncertainty, task efficiency, algorithm scalability, and safety concerns. To
address these challenges, this paper studies multi-robot coordination and
proposes APEX-MR, an asynchronous planning and execution framework designed to
safely and efficiently coordinate multiple robots to achieve cooperative
assembly, e.g., LEGO assembly. In particular, APEX-MR provides a systematic
approach to post-process multi-robot tasks and motion plans to enable robust
asynchronous execution under uncertainty. Experimental results demonstrate that
APEX-MR can significantly speed up the execution time of many long-horizon LEGO
assembly tasks by 48% compared to sequential planning and 36% compared to
synchronous planning on average. To further demonstrate performance, we deploy
APEX-MR in a dual-arm system to perform physical LEGO assembly. To our
knowledge, this is the first robotic system capable of performing customized
LEGO assembly using commercial LEGO bricks. The experimental results
demonstrate that the dual-arm system, with APEX-MR, can safely coordinate robot
motions, efficiently collaborate, and construct complex LEGO structures. Our
project website is available at
https://intelligent-control-lab.github.io/APEX-MR/.
comment: 17 pages, 11 figures. To appear in the proceedings of RSS 2025
♻ ☆ Task-driven SLAM Benchmarking For Robot Navigation IROS 2025
A critical use case of SLAM for mobile assistive robots is to support
localization during a navigation-based task. Current SLAM benchmarks overlook
the significance of repeatability (precision), despite its importance in
real-world deployments. To address this gap, we propose a task-driven approach
to SLAM benchmarking, TaskSLAM-Bench. It employs precision as a key metric,
accounts for SLAM's mapping capabilities, and has easy-to-meet implementation
requirements. Simulated and real-world testing scenarios of SLAM methods
provide insights into the navigation performance properties of modern visual
and LiDAR SLAM solutions. The outcomes show that passive stereo SLAM operates
at a level of precision comparable to LiDAR SLAM in typical indoor
environments. TaskSLAM-Bench complements existing benchmarks and offers richer
assessment of SLAM performance in navigation-focused scenarios. Publicly
available code permits in-situ SLAM testing in custom environments with
properly equipped robots.
comment: 7 pages, 8 figures, 1 table. Accepted to IROS 2025
♻ ☆ Tunable Leg Stiffness in a Monopedal Hopper for Energy-Efficient Vertical Hopping Across Varying Ground Profiles ICRA
We present the design and implementation of HASTA (Hopper with Adjustable
Stiffness for Terrain Adaptation), a vertical hopping robot with real-time
tunable leg stiffness, aimed at optimizing energy efficiency across various
ground profiles (a pair of ground stiffness and damping conditions). By
adjusting leg stiffness, we aim to maximize apex hopping height, a key metric
for energy-efficient vertical hopping. We hypothesize that softer legs perform
better on soft, damped ground by minimizing penetration and energy loss, while
stiffer legs excel on hard, less damped ground by reducing limb deformation and
energy dissipation. Through experimental tests and simulations, we find the
best leg stiffness within our selection for each combination of ground
stiffness and damping, enabling the robot to achieve maximum steady-state
hopping height with a constant energy input. These results support our
hypothesis that tunable stiffness improves energy-efficient locomotion in
controlled experimental conditions. In addition, the simulation provides
insights that could aid in the future development of controllers for selecting
leg stiffness.
comment: 2025 IEEE International Conference on Robotics & Automation (ICRA)
♻ ☆ Following Is All You Need: Robot Crowd Navigation Using People As Planners
Navigating in crowded environments requires the robot to be equipped with
high-level reasoning and planning techniques. Existing works focus on
developing complex and heavyweight planners while ignoring the role of human
intelligence. Since humans are highly capable agents who are also widely
available in a crowd navigation setting, we propose an alternative scheme where
the robot utilises people as planners to benefit from their effective planning
decisions and social behaviours. Through a set of rule-based evaluations, we
identify suitable human leaders who exhibit the potential to guide the robot
towards its goal. Using a simple base planner, the robot follows the selected
leader through shorthorizon subgoals that are designed to be straightforward to
achieve. We demonstrate through both simulated and real-world experiments that
our novel framework generates safe and efficient robot plans compared to
existing planners, even without predictive or data-driven modules. Our method
also brings human-like robot behaviours without explicitly defining traffic
rules and social norms. Code will be available at
https://github.com/centiLinda/PeopleAsPlanner.git.
comment: RAL 2025
♻ ☆ Multi-Fidelity Reinforcement Learning for Time-Optimal Quadrotor Re-planning
High-speed online trajectory planning for UAVs poses a significant challenge
due to the need for precise modeling of complex dynamics while also being
constrained by computational limitations. This paper presents a multi-fidelity
reinforcement learning method (MFRL) that aims to effectively create a
realistic dynamics model and simultaneously train a planning policy that can be
readily deployed in real-time applications. The proposed method involves the
co-training of a planning policy and a reward estimator; the latter predicts
the performance of the policy's output and is trained efficiently through
multi-fidelity Bayesian optimization. This optimization approach models the
correlation between different fidelity levels, thereby constructing a
high-fidelity model based on a low-fidelity foundation, which enables the
accurate development of the reward model with limited high-fidelity
experiments. The framework is further extended to include real-world flight
experiments in reinforcement learning training, allowing the reward model to
precisely reflect real-world constraints and broadening the policy's
applicability to real-world scenarios. We present rigorous evaluations by
training and testing the planning policy in both simulated and real-world
environments. The resulting trained policy not only generates faster and more
reliable trajectories compared to the baseline snap minimization method, but it
also achieves trajectory updates in 2 ms on average, while the baseline method
takes several minutes.
comment: Accepted for publication in the International Journal of Robotics
Research
♻ ☆ BUFFER-X: Towards Zero-Shot Point Cloud Registration in Diverse Scenes ICCV 2025
Recent advances in deep learning-based point cloud registration have improved
generalization, yet most methods still require retraining or manual parameter
tuning for each new environment. In this paper, we identify three key factors
limiting generalization: (a) reliance on environment-specific voxel size and
search radius, (b) poor out-of-domain robustness of learning-based keypoint
detectors, and (c) raw coordinate usage, which exacerbates scale discrepancies.
To address these issues, we present a zero-shot registration pipeline called
BUFFER-X by (a) adaptively determining voxel size/search radii, (b) using
farthest point sampling to bypass learned detectors, and (c) leveraging
patch-wise scale normalization for consistent coordinate bounds. In particular,
we present a multi-scale patch-based descriptor generation and a hierarchical
inlier search across scales to improve robustness in diverse scenes. We also
propose a novel generalizability benchmark using 11 datasets that cover various
indoor/outdoor scenarios and sensor modalities, demonstrating that BUFFER-X
achieves substantial generalization without prior information or manual
parameter tuning for the test datasets. Our code is available at
https://github.com/MIT-SPARK/BUFFER-X.
comment: 20 pages, 14 figures. Accepted as a highlight paper at ICCV 2025