Robotics 47
☆ Interactive Post-Training for Vision-Language-Action Models
We introduce RIPT-VLA, a simple and scalable reinforcement-learning-based
interactive post-training paradigm that fine-tunes pretrained
Vision-Language-Action (VLA) models using only sparse binary success rewards.
Existing VLA training pipelines rely heavily on offline expert demonstration
data and supervised imitation, limiting their ability to adapt to new tasks and
environments under low-data regimes. RIPT-VLA addresses this by enabling
interactive post-training with a stable policy optimization algorithm based on
dynamic rollout sampling and leave-one-out advantage estimation.
RIPT-VLA has the following characteristics. First, it applies to various VLA
models, resulting in an improvement on the lightweight QueST model by 21.2%,
and the 7B OpenVLA-OFT model to an unprecedented 97.5% success rate. Second, it
is computationally efficient and data-efficient: with only one demonstration,
RIPT-VLA enables an unworkable SFT model (4%) to succeed with a 97% success
rate within 15 iterations. Furthermore, we demonstrate that the policy learned
by RIPT-VLA generalizes across different tasks and scenarios and is robust to
the initial state context. These results highlight RIPT-VLA as a practical and
effective paradigm for post-training VLA models through minimal supervision.
comment: Project page: https://ariostgx.github.io/ript_vla/
☆ CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning
Jiange Yang, Yansong Shi, Haoyi Zhu, Mingyu Liu, Kaijing Ma, Yating Wang, Gangshan Wu, Tong He, Limin Wang
Learning latent motion from Internet videos is crucial for building
generalist robots. However, existing discrete latent action methods suffer from
information loss and struggle with complex and fine-grained dynamics. We
propose CoMo, which aims to learn more informative continuous motion
representations from diverse, internet-scale videos. CoMo employs a early
temporal feature difference mechanism to prevent model collapse and suppress
static appearance noise, effectively discouraging shortcut learning problem.
Furthermore, guided by the information bottleneck principle, we constrain the
latent motion embedding dimensionality to achieve a better balance between
retaining sufficient action-relevant information and minimizing the inclusion
of action-irrelevant appearance noise. Additionally, we also introduce two new
metrics for more robustly and affordably evaluating motion and guiding motion
learning methods development: (i) the linear probing MSE of action prediction,
and (ii) the cosine similarity between past-to-current and future-to-current
motion embeddings. Critically, CoMo exhibits strong zero-shot generalization,
enabling it to generate continuous pseudo actions for previously unseen video
domains. This capability facilitates unified policy joint learning using pseudo
actions derived from various action-less video datasets (such as
cross-embodiment videos and, notably, human demonstration videos), potentially
augmented with limited labeled robot data. Extensive experiments show that
policies co-trained with CoMo pseudo actions achieve superior performance with
both diffusion and autoregressive architectures in simulated and real-world
settings.
comment: 18 pages, 7 figures
☆ Extremely Simple Multimodal Outlier Synthesis for Out-of-Distribution Detection and Segmentation
Out-of-distribution (OOD) detection and segmentation are crucial for
deploying machine learning models in safety-critical applications such as
autonomous driving and robot-assisted surgery. While prior research has
primarily focused on unimodal image data, real-world applications are
inherently multimodal, requiring the integration of multiple modalities for
improved OOD detection. A key challenge is the lack of supervision signals from
unknown data, leading to overconfident predictions on OOD samples. To address
this challenge, we propose Feature Mixing, an extremely simple and fast method
for multimodal outlier synthesis with theoretical support, which can be further
optimized to help the model better distinguish between in-distribution (ID) and
OOD data. Feature Mixing is modality-agnostic and applicable to various
modality combinations. Additionally, we introduce CARLA-OOD, a novel multimodal
dataset for OOD segmentation, featuring synthetic OOD objects across diverse
scenes and weather conditions. Extensive experiments on SemanticKITTI,
nuScenes, CARLA-OOD datasets, and the MultiOOD benchmark demonstrate that
Feature Mixing achieves state-of-the-art performance with a $10 \times$ to $370
\times$ speedup. Our source code and dataset will be available at
https://github.com/mona4399/FeatureMixing.
☆ 3D Equivariant Visuomotor Policy Learning via Spherical Projection
Equivariant models have recently been shown to improve the data efficiency of
diffusion policy by a significant margin. However, prior work that explored
this direction focused primarily on point cloud inputs generated by multiple
cameras fixed in the workspace. This type of point cloud input is not
compatible with the now-common setting where the primary input modality is an
eye-in-hand RGB camera like a GoPro. This paper closes this gap by
incorporating into the diffusion policy model a process that projects features
from the 2D RGB camera image onto a sphere. This enables us to reason about
symmetries in SO(3) without explicitly reconstructing a point cloud. We perform
extensive experiments in both simulation and the real world that demonstrate
that our method consistently outperforms strong baselines in terms of both
performance and sample efficiency. Our work is the first SO(3)-equivariant
policy learning framework for robotic manipulation that works using only
monocular RGB inputs.
☆ Beyond Needle(s) in the Embodied Haystack: Environment, Architecture, and Training Considerations for Long Context Reasoning
We introduce $\infty$-THOR, a new framework for long-horizon embodied tasks
that advances long-context understanding in embodied AI. $\infty$-THOR
provides: (1) a generation framework for synthesizing scalable, reproducible,
and unlimited long-horizon trajectories; (2) a novel embodied QA task,
Needle(s) in the Embodied Haystack, where multiple scattered clues across
extended trajectories test agents' long-context reasoning ability; and (3) a
long-horizon dataset and benchmark suite featuring complex tasks that span
hundreds of environment steps, each paired with ground-truth action sequences.
To enable this capability, we explore architectural adaptations, including
interleaved Goal-State-Action modeling, context extension techniques, and
Context Parallelism, to equip LLM-based agents for extreme long-context
reasoning and interaction. Experimental results and analyses highlight the
challenges posed by our benchmark and provide insights into training strategies
and model behaviors under long-horizon conditions. Our work provides a
foundation for the next generation of embodied AI systems capable of robust,
long-term reasoning and planning.
☆ UAV See, UGV Do: Aerial Imagery and Virtual Teach Enabling Zero-Shot Ground Vehicle Repeat IROS 2025
This paper presents Virtual Teach and Repeat (VirT&R): an extension of the
Teach and Repeat (T&R) framework that enables GPS-denied, zero-shot autonomous
ground vehicle navigation in untraversed environments. VirT&R leverages aerial
imagery captured for a target environment to train a Neural Radiance Field
(NeRF) model so that dense point clouds and photo-textured meshes can be
extracted. The NeRF mesh is used to create a high-fidelity simulation of the
environment for piloting an unmanned ground vehicle (UGV) to virtually define a
desired path. The mission can then be executed in the actual target environment
by using NeRF-derived point cloud submaps associated along the path and an
existing LiDAR Teach and Repeat (LT&R) framework. We benchmark the
repeatability of VirT&R on over 12 km of autonomous driving data using physical
markings that allow a sim-to-real lateral path-tracking error to be obtained
and compared with LT&R. VirT&R achieved measured root mean squared errors
(RMSE) of 19.5 cm and 18.4 cm in two different environments, which are slightly
less than one tire width (24 cm) on the robot used for testing, and respective
maximum errors were 39.4 cm and 47.6 cm. This was done using only the
NeRF-derived teach map, demonstrating that VirT&R has similar closed-loop
path-tracking performance to LT&R but does not require a human to manually
teach the path to the UGV in the actual environment.
comment: 8 pages, 7 figures, submitted to IROS 2025
☆ RealEngine: Simulating Autonomous Driving in Realistic Context
Driving simulation plays a crucial role in developing reliable driving agents
by providing controlled, evaluative environments. To enable meaningful
assessments, a high-quality driving simulator must satisfy several key
requirements: multi-modal sensing capabilities (e.g., camera and LiDAR) with
realistic scene rendering to minimize observational discrepancies; closed-loop
evaluation to support free-form trajectory behaviors; highly diverse traffic
scenarios for thorough evaluation; multi-agent cooperation to capture
interaction dynamics; and high computational efficiency to ensure affordability
and scalability. However, existing simulators and benchmarks fail to
comprehensively meet these fundamental criteria. To bridge this gap, this paper
introduces RealEngine, a novel driving simulation framework that holistically
integrates 3D scene reconstruction and novel view synthesis techniques to
achieve realistic and flexible closed-loop simulation in the driving context.
By leveraging real-world multi-modal sensor data, RealEngine reconstructs
background scenes and foreground traffic participants separately, allowing for
highly diverse and realistic traffic scenarios through flexible scene
composition. This synergistic fusion of scene reconstruction and view synthesis
enables photorealistic rendering across multiple sensor modalities, ensuring
both perceptual fidelity and geometric accuracy. Building upon this
environment, RealEngine supports three essential driving simulation categories:
non-reactive simulation, safety testing, and multi-agent interaction,
collectively forming a reliable and comprehensive benchmark for evaluating the
real-world performance of driving agents.
☆ FlashBack: Consistency Model-Accelerated Shared Autonomy
Shared autonomy is an enabling technology that provides users with control
authority over robots that would otherwise be difficult if not impossible to
directly control. Yet, standard methods make assumptions that limit their
adoption in practice-for example, prior knowledge of the user's goals or the
objective (i.e., reward) function that they wish to optimize, knowledge of the
user's policy, or query-level access to the user during training.
Diffusion-based approaches to shared autonomy do not make such assumptions and
instead only require access to demonstrations of desired behaviors, while
allowing the user to maintain control authority. However, these advantages have
come at the expense of high computational complexity, which has made real-time
shared autonomy all but impossible. To overcome this limitation, we propose
Consistency Shared Autonomy (CSA), a shared autonomy framework that employs a
consistency model-based formulation of diffusion. Key to CSA is that it employs
the distilled probability flow of ordinary differential equations (PF ODE) to
generate high-fidelity samples in a single step. This results in inference
speeds significantly than what is possible with previous diffusion-based
approaches to shared autonomy, enabling real-time assistance in complex domains
with only a single function evaluation. Further, by intervening on flawed
actions at intermediate states of the PF ODE, CSA enables varying levels of
assistance. We evaluate CSA on a variety of challenging simulated and
real-world robot control problems, demonstrating significant improvements over
state-of-the-art methods both in terms of task performance and computational
efficiency.
☆ Efficient Online RL Fine Tuning with Offline Pre-trained Policy Only
Improving the performance of pre-trained policies through online
reinforcement learning (RL) is a critical yet challenging topic. Existing
online RL fine-tuning methods require continued training with offline
pretrained Q-functions for stability and performance. However, these offline
pretrained Q-functions commonly underestimate state-action pairs beyond the
offline dataset due to the conservatism in most offline RL methods, which
hinders further exploration when transitioning from the offline to the online
setting. Additionally, this requirement limits their applicability in scenarios
where only pre-trained policies are available but pre-trained Q-functions are
absent, such as in imitation learning (IL) pre-training. To address these
challenges, we propose a method for efficient online RL fine-tuning using
solely the offline pre-trained policy, eliminating reliance on pre-trained
Q-functions. We introduce PORL (Policy-Only Reinforcement Learning
Fine-Tuning), which rapidly initializes the Q-function from scratch during the
online phase to avoid detrimental pessimism. Our method not only achieves
competitive performance with advanced offline-to-online RL algorithms and
online RL approaches that leverage data or policies prior, but also pioneers a
new path for directly fine-tuning behavior cloning (BC) policies.
☆ Perceptual Quality Assessment for Embodied AI
Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai
Embodied AI has developed rapidly in recent years, but it is still mainly
deployed in laboratories, with various distortions in the Real-world limiting
its application. Traditionally, Image Quality Assessment (IQA) methods are
applied to predict human preferences for distorted images; however, there is no
IQA method to assess the usability of an image in embodied tasks, namely, the
perceptual quality for robots. To provide accurate and reliable quality
indicators for future embodied scenarios, we first propose the topic: IQA for
Embodied AI. Specifically, we (1) based on the Mertonian system and
meta-cognitive theory, constructed a perception-cognition-decision-execution
pipeline and defined a comprehensive subjective score collection process; (2)
established the Embodied-IQA database, containing over 36k reference/distorted
image pairs, with more than 5m fine-grained annotations provided by Vision
Language Models/Vision Language Action-models/Real-world robots; (3) trained
and validated the performance of mainstream IQA methods on Embodied-IQA,
demonstrating the need to develop more accurate quality indicators for Embodied
AI. We sincerely hope that through evaluation, we can promote the application
of Embodied AI under complex distortions in the Real-world. Project page:
https://github.com/lcysyzxdxc/EmbodiedIQA
☆ D-LIO: 6DoF Direct LiDAR-Inertial Odometry based on Simultaneous Truncated Distance Field Mapping
This paper presents a new approach for 6DoF Direct LiDAR-Inertial Odometry
(D-LIO) based on the simultaneous mapping of truncated distance fields on CPU.
Such continuous representation (in the vicinity of the points) enables working
with raw 3D LiDAR data online, avoiding the need of LiDAR feature selection and
tracking, simplifying the odometry pipeline and easily generalizing to many
scenarios. The method is based on the proposed Fast Truncated Distance Field
(Fast-TDF) method as a convenient tool to represent the environment. Such
representation enables i) solving the LiDAR point-cloud registration as a
nonlinear optimization process without the need of selecting/tracking LiDAR
features in the input data, ii) simultaneously producing an accurate truncated
distance field map of the environment, and iii) updating such map at constant
time independently of its size. The approach is tested using open datasets,
aerial and ground. It is also benchmarked against other state-of-the-art
odometry approaches, demonstrating the same or better level of accuracy with
the added value of an online-generated TDF representation of the environment,
that can be used for other robotics tasks as planning or collision avoidance.
The source code is publicly available at
https://anonymous.4open.science/r/D-LIO
comment: 9 pages, 4 figures and 43 references
☆ MEbots: Integrating a RISC-V Virtual Platform with a Robotic Simulator for Energy-aware Design
Giovanni Pollo, Mohamed Amine Hamdi, Matteo Risso, Lorenzo Ruotolo, Pietro Furbatto, Matteo Isoldi, Yukai Chen, Alessio Burrello, Enrico Macii, Massimo Poncino, Daniele Jahier Pagliari, Sara Vinco
Virtual Platforms (VPs) enable early software validation of autonomous
systems' electronics, reducing costs and time-to-market. While many VPs support
both functional and non-functional simulation (e.g., timing, power), they lack
the capability of simulating the environment in which the system operates. In
contrast, robotics simulators lack accurate timing and power features. This
twofold shortcoming limits the effectiveness of the design flow, as the
designer can not fully evaluate the features of the solution under development.
This paper presents a novel, fully open-source framework bridging this gap by
integrating a robotics simulator (Webots) with a VP for RISC-V-based systems
(MESSY). The framework enables a holistic, mission-level, energy-aware
co-simulation of electronics in their surrounding environment, streamlining the
exploration of design configurations and advanced power management policies.
☆ Joint Magnetometer-IMU Calibration via Maximum A Posteriori Estimation
This paper presents a new approach for jointly calibrating magnetometers and
inertial measurement units, focusing on improving calibration accuracy and
computational efficiency. The proposed method formulates the calibration
problem as a maximum a posteriori estimation problem, treating both the
calibration parameters and orientation trajectory of the sensors as unknowns.
This formulation enables efficient optimization with closed-form derivatives.
The method is compared against two state-of-the-art approaches in terms of
computational complexity and estimation accuracy. Simulation results
demonstrate that the proposed method achieves lower root mean square error in
calibration parameters while maintaining competitive computational efficiency.
Further validation through real-world experiments confirms the practical
benefits of our approach: it effectively reduces position drift in a magnetic
field-aided inertial navigation system by more than a factor of two on most
datasets. Moreover, the proposed method calibrated 30 magnetometers in less
than 2 minutes. The contributions include a new calibration method, an analysis
of existing methods, and a comprehensive empirical evaluation. Datasets and
algorithms are made publicly available to promote reproducible research.
☆ Monitoring Electrostatic Adhesion Forces via Acoustic Pressure
Electrostatic adhesion is widely used in mobile robotics, haptics, and
robotic end effectors for its adaptability to diverse substrates and low energy
consumption. Force sensing is important for feedback control, interaction, and
monitoring in the EA system. However, EA force monitoring often relies on bulky
and expensive sensors, increasing the complexity and weight of the entire
system. This paper presents an acoustic-pressure-based method to monitor EA
forces without contacting the adhesion pad. When the EA pad is driven by a
bipolar square-wave voltage to adhere a conductive object, periodic acoustic
pulses arise from the EA system. We employed a microphone to capture these
acoustic pressure signals and investigate the influence of peak pressure
values. Results show that the peak value of acoustic pressure increased with
the mass and contact area of the adhered object, as well as with the amplitude
and frequency of the driving voltage. We applied this technique to mass
estimation of various objects and simultaneous monitoring of two EA systems.
Then, we integrated this technique into an EA end effector that enables
monitoring the change of adhered object mass during transport. The proposed
technique offers a low-cost, non-contact, and multi-object monitoring solution
for EA end effectors in handling tasks.
comment: 6 pages, 7 figures
☆ Safe Uncertainty-Aware Learning of Robotic Suturing
Robot-Assisted Minimally Invasive Surgery is currently fully manually
controlled by a trained surgeon. Automating this has great potential for
alleviating issues, e.g., physical strain, highly repetitive tasks, and
shortages of trained surgeons. For these reasons, recent works have utilized
Artificial Intelligence methods, which show promising adaptability. Despite
these advances, there is skepticism of these methods because they lack
explainability and robust safety guarantees. This paper presents a framework
for a safe, uncertainty-aware learning method. We train an Ensemble Model of
Diffusion Policies using expert demonstrations of needle insertion. Using an
Ensemble model, we can quantify the policy's epistemic uncertainty, which is
used to determine Out-Of-Distribution scenarios. This allows the system to
release control back to the surgeon in the event of an unsafe scenario.
Additionally, we implement a model-free Control Barrier Function to place
formal safety guarantees on the predicted action. We experimentally evaluate
our proposed framework using a state-of-the-art robotic suturing simulator. We
evaluate multiple scenarios, such as dropping the needle, moving the camera,
and moving the phantom. The learned policy is robust to these perturbations,
showing corrective behaviors and generalization, and it is possible to detect
Out-Of-Distribution scenarios. We further demonstrate that the Control Barrier
Function successfully limits the action to remain within our specified safety
set in the case of unsafe predictions.
☆ Find the Fruit: Designing a Zero-Shot Sim2Real Deep RL Planner for Occlusion Aware Plant Manipulation
This paper presents an end-to-end deep reinforcement learning (RL) framework
for occlusion-aware robotic manipulation in cluttered plant environments. Our
approach enables a robot to interact with a deformable plant to reveal hidden
objects of interest, such as fruits, using multimodal observations. We decouple
the kinematic planning problem from robot control to simplify zero-shot
sim2real transfer for the trained policy. Our results demonstrate that the
trained policy, deployed using our framework, achieves up to 86.7% success in
real-world trials across diverse initial conditions. Our findings pave the way
toward autonomous, perception-driven agricultural robots that intelligently
interact with complex foliage plants to "find the fruit" in challenging
occluded scenarios, without the need for explicitly designed geometric and
dynamic models of every plant scenario.
comment: 18 Pages, 15 Figures, 5 Tables
☆ ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models
Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, Zhongzhi Li, Rui Yan, Xiuying Chen
Large Vision-Language Models (LVLMs) have recently advanced robotic
manipulation by leveraging vision for scene perception and language for
instruction following. However, existing methods rely heavily on costly
human-annotated training datasets, which limits their generalization and causes
them to struggle in out-of-domain (OOD) scenarios, reducing real-world
adaptability. To address these challenges, we propose ManipLVM-R1, a novel
reinforcement learning framework that replaces traditional supervision with
Reinforcement Learning using Verifiable Rewards (RLVR). By directly optimizing
for task-aligned outcomes, our method enhances generalization and physical
reasoning while removing the dependence on costly annotations. Specifically, we
design two rule-based reward functions targeting key robotic manipulation
subtasks: an Affordance Perception Reward to enhance localization of
interaction regions, and a Trajectory Match Reward to ensure the physical
plausibility of action paths. These rewards provide immediate feedback and
impose spatial-logical constraints, encouraging the model to go beyond shallow
pattern matching and instead learn deeper, more systematic reasoning about
physical interactions.
comment: 13pages
☆ Human-like Semantic Navigation for Autonomous Driving using Knowledge Representation and Large Language Models
Achieving full automation in self-driving vehicles remains a challenge,
especially in dynamic urban environments where navigation requires real-time
adaptability. Existing systems struggle to handle navigation plans when faced
with unpredictable changes in road layouts, spontaneous detours, or missing map
data, due to their heavy reliance on predefined cartographic information. In
this work, we explore the use of Large Language Models to generate Answer Set
Programming rules by translating informal navigation instructions into
structured, logic-based reasoning. ASP provides non-monotonic reasoning,
allowing autonomous vehicles to adapt to evolving scenarios without relying on
predefined maps. We present an experimental evaluation in which LLMs generate
ASP constraints that encode real-world urban driving logic into a formal
knowledge representation. By automating the translation of informal navigation
instructions into logical rules, our method improves adaptability and
explainability in autonomous navigation. Results show that LLM-driven ASP rule
generation supports semantic-based decision-making, offering an explainable
framework for dynamic navigation planning that aligns closely with how humans
communicate navigational intent.
comment: 7 pages, 5 figures, submitted for IEEE conference
☆ Unified Multi-Rate Model Predictive Control for a Jet-Powered Humanoid Robot
We propose a novel Model Predictive Control (MPC) framework for a jet-powered
flying humanoid robot. The controller is based on a linearised centroidal
momentum model to represent the flight dynamics, augmented with a second-order
nonlinear model to explicitly account for the slow and nonlinear dynamics of
jet propulsion. A key contribution is the introduction of a multi-rate MPC
formulation that handles the different actuation rates of the robot's joints
and jet engines while embedding the jet dynamics directly into the predictive
model. We validated the framework using the jet-powered humanoid robot iRonCub,
performing simulations in Mujoco; the simulation results demonstrate the
robot's ability to recover from external disturbances and perform stable,
non-abrupt flight manoeuvres, validating the effectiveness of the proposed
approach.
comment: 8 pages, 6 figures
☆ SpineWave: Harnessing Fish Rigid-Flexible Spinal Kinematics for Enhancing Biomimetic Robotic Locomotion
Qu He, Weikun Li, Guangmin Dai, Hao Chen, Qimeng Liu, Xiaoqing Tian, Jie You, Weicheng Cui, Michael S. Triantafyllou, Dixia Fan
Fish have endured millions of years of evolution, and their distinct
rigid-flexible body structures offer inspiration for overcoming challenges in
underwater robotics, such as limited mobility, high energy consumption, and
adaptability. This paper introduces SpineWave, a biomimetic robotic fish
featuring a fish-spine-like rigid-flexible transition structure. The structure
integrates expandable fishbone-like ribs and adjustable magnets, mimicking the
stretch and recoil of fish muscles to balance rigidity and flexibility. In
addition, we employed an evolutionary algorithm to optimize the hydrodynamics
of the robot, achieving significant improvements in swimming performance.
Real-world tests demonstrated robustness and potential for environmental
monitoring, underwater exploration, and industrial inspection. These tests
established SpineWave as a transformative platform for aquatic robotics.
☆ Raw2Drive: Reinforcement Learning with Aligned World Models for End-to-End Autonomous Driving (in CARLA v2)
Reinforcement Learning (RL) can mitigate the causal confusion and
distribution shift inherent to imitation learning (IL). However, applying RL to
end-to-end autonomous driving (E2E-AD) remains an open problem for its training
difficulty, and IL is still the mainstream paradigm in both academia and
industry. Recently Model-based Reinforcement Learning (MBRL) have demonstrated
promising results in neural planning; however, these methods typically require
privileged information as input rather than raw sensor data. We fill this gap
by designing Raw2Drive, a dual-stream MBRL approach. Initially, we efficiently
train an auxiliary privileged world model paired with a neural planner that
uses privileged information as input. Subsequently, we introduce a raw sensor
world model trained via our proposed Guidance Mechanism, which ensures
consistency between the raw sensor world model and the privileged world model
during rollouts. Finally, the raw sensor world model combines the prior
knowledge embedded in the heads of the privileged world model to effectively
guide the training of the raw sensor policy. Raw2Drive is so far the only RL
based end-to-end method on CARLA Leaderboard 2.0, and Bench2Drive and it
achieves state-of-the-art performance.
☆ VL-SAFE: Vision-Language Guided Safety-Aware Reinforcement Learning with World Models for Autonomous Driving
Reinforcement learning (RL)-based autonomous driving policy learning faces
critical limitations such as low sample efficiency and poor generalization; its
reliance on online interactions and trial-and-error learning is especially
unacceptable in safety-critical scenarios. Existing methods including safe RL
often fail to capture the true semantic meaning of "safety" in complex driving
contexts, leading to either overly conservative driving behavior or constraint
violations. To address these challenges, we propose VL-SAFE, a world
model-based safe RL framework with Vision-Language model
(VLM)-as-safety-guidance paradigm, designed for offline safe policy learning.
Specifically, we construct offline datasets containing data collected by expert
agents and labeled with safety scores derived from VLMs. A world model is
trained to generate imagined rollouts together with safety estimations,
allowing the agent to perform safe planning without interacting with the real
environment. Based on these imagined trajectories and safety evaluations,
actor-critic learning is conducted under VLM-based safety guidance to optimize
the driving policy more safely and efficiently. Extensive evaluations
demonstrate that VL-SAFE achieves superior sample efficiency, generalization,
safety, and overall performance compared to existing baselines. To the best of
our knowledge, this is the first work that introduces a VLM-guided world
model-based approach for safe autonomous driving. The demo video and code can
be accessed at: https://ys-qu.github.io/vlsafe-website/
☆ TacCompress: A Benchmark for Multi-Point Tactile Data Compression in Dexterous Manipulation
Though robotic dexterous manipulation has progressed substantially recently,
challenges like in-hand occlusion still necessitate fine-grained tactile
perception, leading to the integration of more tactile sensors into robotic
hands. Consequently, the increased data volume imposes substantial bandwidth
pressure on signal transmission from the hand's controller. However, the
acquisition and compression of multi-point tactile signals based on the
dexterous hands' physical structures have not been thoroughly explored. In this
paper, our contributions are twofold. First, we introduce a Multi-Point Tactile
Dataset for Dexterous Hand Grasping (Dex-MPTD). This dataset captures tactile
signals from multiple contact sensors across various objects and grasping
poses, offering a comprehensive benchmark for advancing dexterous robotic
manipulation research. Second, we investigate both lossless and lossy
compression on Dex-MPTD by converting tactile data into images and applying six
lossless and five lossy image codecs for efficient compression. Experimental
results demonstrate that tactile data can be losslessly compressed to as low as
0.0364 bits per sub-sample (bpss), achieving approximately 200$\times$
compression ratio compared to the raw tactile data. Efficient lossy compressors
like HM and VTM can achieve about 1000x data reductions while preserving
acceptable data fidelity. The exploration of lossy compression also reveals
that screen-content-targeted coding tools outperform general-purpose codecs in
compressing tactile data.
comment: 8 pages, 10 figures, 2 tables
☆ DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
End-to-end autonomous driving (E2E-AD) demands effective processing of
multi-view sensory data and robust handling of diverse and complex driving
scenarios, particularly rare maneuvers such as aggressive turns. Recent success
of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs)
demonstrates that specialization of parameters enables strong scalability. In
this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a
Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is
built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from
the embodied AI field), called Drive-$\pi_0$. Specifically, we add Vision MoE
to Drive-$\pi_0$ by training a router to select relevant cameras according to
the driving context dynamically. This design mirrors human driving cognition,
where drivers selectively attend to crucial visual cues rather than
exhaustively processing all visual information. In addition, we add Action MoE
by training another router to activate specialized expert modules for different
driving behaviors. Through explicit behavioral specialization, DriveMoE is able
to handle diverse scenarios without suffering from modes averaging like
existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE
achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness
of combining vision and action MoE in autonomous driving tasks. We will release
our code and models of DriveMoE and Drive-$\pi_0$.
comment: Project Page: https://thinklab-sjtu.github.io/DriveMoE/
☆ Manipulating Elasto-Plastic Objects With 3D Occupancy and Learning-Based Predictive Control
Manipulating elasto-plastic objects remains a significant challenge due to
severe self-occlusion, difficulties of representation, and complicated
dynamics. This work proposes a novel framework for elasto-plastic object
manipulation with a quasi-static assumption for motions, leveraging 3D
occupancy to represent such objects, a learned dynamics model trained with 3D
occupancy, and a learning-based predictive control algorithm to address these
challenges effectively. We build a novel data collection platform to collect
full spatial information and propose a pipeline for generating a 3D occupancy
dataset. To infer the 3D occupancy during manipulation, an occupancy prediction
network is trained with multiple RGB images supervised by the generated
dataset. We design a deep neural network empowered by a 3D convolution neural
network (CNN) and a graph neural network (GNN) to predict the complex
deformation with the inferred 3D occupancy results. A learning-based predictive
control algorithm is introduced to plan the robot actions, incorporating a
novel shape-based action initialization module specifically designed to improve
the planner efficiency. The proposed framework in this paper can successfully
shape the elasto-plastic objects into a given goal shape and has been verified
in various experiments both in simulation and the real world.
comment: 8 Pages, 5 figures, accepted for publication in IEEE Robotics and
Automation Letters (RA-L)
☆ Behavioral Safety Assessment towards Large-scale Deployment of Autonomous Vehicles
Henry X. Liu, Xintao Yan, Haowei Sun, Tinghan Wang, Zhijie Qiao, Haojie Zhu, Shengyin Shen, Shuo Feng, Greg Stevens, Greg McGuire
Autonomous vehicles (AVs) have significantly advanced in real-world
deployment in recent years, yet safety continues to be a critical barrier to
widespread adoption. Traditional functional safety approaches, which primarily
verify the reliability, robustness, and adequacy of AV hardware and software
systems from a vehicle-centric perspective, do not sufficiently address the
AV's broader interactions and behavioral impact on the surrounding traffic
environment. To overcome this limitation, we propose a paradigm shift toward
behavioral safety, a comprehensive approach focused on evaluating AV responses
and interactions within the traffic environment. To systematically assess
behavioral safety, we introduce a third-party AV safety assessment framework
comprising two complementary evaluation components: the Driver Licensing Test
and the Driving Intelligence Test. The Driver Licensing Test evaluates the AV's
reactive behaviors under controlled scenarios, ensuring basic behavioral
competency. In contrast, the Driving Intelligence Test assesses the AV's
interactive behaviors within naturalistic traffic conditions, quantifying the
frequency of safety-critical events to deliver statistically meaningful safety
metrics before large-scale deployment. We validated our proposed framework
using Autoware.Universe, an open-source Level 4 AV, tested both in simulated
environments and on the physical test track at the University of Michigan's
Mcity Testing Facility. The results indicate that Autoware.Universe passed 6
out of 14 scenarios and exhibited a crash rate of 3.01e-3 crashes per mile,
approximately 1,000 times higher than the average human driver crash rate.
During the tests, we also uncovered several unknown unsafe scenarios for
Autoware.Universe. These findings underscore the necessity of behavioral safety
evaluations for improving AV safety performance prior to widespread public
deployment.
☆ SEM: Enhancing Spatial Understanding for Robust Robot Manipulation
A key challenge in robot manipulation lies in developing policy models with
strong spatial understanding, the ability to reason about 3D geometry, object
relations, and robot embodiment. Existing methods often fall short: 3D point
cloud models lack semantic abstraction, while 2D image encoders struggle with
spatial reasoning. To address this, we propose SEM (Spatial Enhanced
Manipulation model), a novel diffusion-based policy framework that explicitly
enhances spatial understanding from two complementary perspectives. A spatial
enhancer augments visual representations with 3D geometric context, while a
robot state encoder captures embodiment-aware structure through graphbased
modeling of joint dependencies. By integrating these modules, SEM significantly
improves spatial understanding, leading to robust and generalizable
manipulation across diverse tasks that outperform existing baselines.
☆ EasyInsert: A Data-Efficient and Generalizable Insertion Policy
Insertion task is highly challenging that requires robots to operate with
exceptional precision in cluttered environments. Existing methods often have
poor generalization capabilities. They typically function in restricted and
structured environments, and frequently fail when the plug and socket are far
apart, when the scene is densely cluttered, or when handling novel objects.
They also rely on strong assumptions such as access to CAD models or a digital
twin in simulation. To address this, we propose EasyInsert, a framework which
leverages the human intuition that relative pose (delta pose) between plug and
socket is sufficient for successful insertion, and employs efficient and
automated real-world data collection with minimal human labor to train a
generalizable model for relative pose prediction. During execution, EasyInsert
follows a coarse-to-fine execution procedure based on predicted delta pose, and
successfully performs various insertion tasks. EasyInsert demonstrates strong
zero-shot generalization capability for unseen objects in cluttered
environments, handling cases with significant initial pose deviations while
maintaining high sample efficiency and requiring little human effort. In
real-world experiments, with just 5 hours of training data, EasyInsert achieves
over 90% success in zero-shot insertion for 13 out of 15 unseen novel objects,
including challenging objects like Type-C cables, HDMI cables, and Ethernet
cables. Furthermore, with only one human demonstration and 4 minutes of
automatically collected data for fine-tuning, it reaches over 90% success rate
for all 15 objects.
☆ Tactile-based Reinforcement Learning for Adaptive Grasping under Observation Uncertainties
Robotic manipulation in industrial scenarios such as construction commonly
faces uncertain observations in which the state of the manipulating object may
not be accurately captured due to occlusions and partial observables. For
example, object status estimation during pipe assembly, rebar installation, and
electrical installation can be impacted by observation errors. Traditional
vision-based grasping methods often struggle to ensure robust stability and
adaptability. To address this challenge, this paper proposes a tactile
simulator that enables a tactile-based adaptive grasping method to enhance
grasping robustness. This approach leverages tactile feedback combined with the
Proximal Policy Optimization (PPO) reinforcement learning algorithm to
dynamically adjust the grasping posture, allowing adaptation to varying
grasping conditions under inaccurate object state estimations. Simulation
results demonstrate that the proposed method effectively adapts grasping
postures, thereby improving the success rate and stability of grasping tasks.
☆ RE-TRIP : Reflectivity Instance Augmented Triangle Descriptor for 3D Place Recognition
While most people associate LiDAR primarily with its ability to measure
distances and provide geometric information about the environment (via point
clouds), LiDAR also captures additional data, including reflectivity or
intensity values. Unfortunately, when LiDAR is applied to Place Recognition
(PR) in mobile robotics, most previous works on LiDAR-based PR rely only on
geometric measurements, neglecting the additional reflectivity information that
LiDAR provides. In this paper, we propose a novel descriptor for 3D PR, named
RE-TRIP (REflectivity-instance augmented TRIangle descriPtor). This new
descriptor leverages both geometric measurements and reflectivity to enhance
robustness in challenging scenarios such as geometric degeneracy, high
geometric similarity, and the presence of dynamic objects. To implement RE-TRIP
in real-world applications, we further propose (1) a keypoint extraction
method, (2) a key instance segmentation method, (3) a RE-TRIP matching method,
and (4) a reflectivity-combined loop verification method. Finally, we conduct a
series of experiments to demonstrate the effectiveness of RE-TRIP. Applied to
public datasets (i.e., HELIPR, FusionPortable) containing diverse scenarios
such as long corridors, bridges, large-scale urban areas, and highly dynamic
environments -- our experimental results show that the proposed method
outperforms existing state-of-the-art methods in terms of Scan Context,
Intensity Scan Context, and STD.
☆ Event-based Reconfiguration Control for Time-varying Formation of Robot Swarms in Narrow Spaces
This study proposes an event-based reconfiguration control to navigate a
robot swarm through challenging environments with narrow passages such as
valleys, tunnels, and corridors. The robot swarm is modeled as an undirected
graph, where each node represents a robot capable of collecting real-time data
on the environment and the states of other robots in the formation. This data
serves as the input for the controller to provide dynamic adjustments between
the desired and straight-line configurations. The controller incorporates a set
of behaviors, designed using artificial potential fields, to meet the
requirements of goal-oriented motion, formation maintenance, tailgating, and
collision avoidance. The stability of the formation control is guaranteed via
the Lyapunov theorem. Simulation and comparison results show that the proposed
controller not only successfully navigates the robot swarm through narrow
spaces but also outperforms other established methods in key metrics including
the success rate, heading order, speed, travel time, and energy efficiency.
Software-in-the-loop tests have also been conducted to validate the
controller's applicability in practical scenarios. The source code of the
controller is available at https://github.com/duynamrcv/erc.
♻ ☆ GraspMolmo: Generalizable Task-Oriented Grasping via Large-Scale Synthetic Data Generation
Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, Rose Hendrix
We present GrasMolmo, a generalizable open-vocabulary task-oriented grasping
(TOG) model. GraspMolmo predicts semantically appropriate, stable grasps
conditioned on a natural language instruction and a single RGB-D frame. For
instance, given "pour me some tea", GraspMolmo selects a grasp on a teapot
handle rather than its body. Unlike prior TOG methods, which are limited by
small datasets, simplistic language, and uncluttered scenes, GraspMolmo learns
from PRISM, a novel large-scale synthetic dataset of 379k samples featuring
cluttered environments and diverse, realistic task descriptions. We fine-tune
the Molmo visual-language model on this data, enabling GraspMolmo to generalize
to novel open-vocabulary instructions and objects. In challenging real-world
evaluations, GraspMolmo achieves state-of-the-art results, with a 70%
prediction success on complex tasks, compared to the 35% achieved by the next
best alternative. GraspMolmo also successfully demonstrates the ability to
predict semantically correct bimanual grasps zero-shot. We release our
synthetic dataset, code, model, and benchmarks to accelerate research in
task-semantic robotic manipulation, which, along with videos, are available at
https://abhaybd.github.io/GraspMolmo/.
♻ ☆ InSpire: Vision-Language-Action Models with Intrinsic Spatial Reasoning
Leveraging pretrained Vision-Language Models (VLMs) to map language
instruction and visual observations to raw low-level actions,
Vision-Language-Action models (VLAs) hold great promise for achieving
general-purpose robotic systems. Despite their advancements, existing VLAs tend
to spuriously correlate task-irrelevant visual features with actions, limiting
their generalization capacity beyond the training data. To tackle this
challenge, we propose Intrinsic Spatial Reasoning (InSpire), a simple yet
effective approach that mitigates the adverse effects of spurious correlations
by boosting the spatial reasoning ability of VLAs. Specifically, InSpire
redirects the VLA's attention to task-relevant factors by prepending the
question "In which direction is the [object] relative to the robot?" to the
language instruction and aligning the answer
"right/left/up/down/front/back/grasped" and predicted actions with the
ground-truth. Notably, InSpire can be used as a plugin to enhance existing
autoregressive VLAs, requiring no extra training data or interaction with other
large models. Extensive experimental results in both simulation and real-world
environments demonstrate the effectiveness and flexibility of our approach. Our
code, pretrained models and demos are publicly available at:
https://Koorye.github.io/proj/Inspire.
♻ ☆ What Matters in Learning A Zero-Shot Sim-to-Real RL Policy for Quadrotor Control? A Comprehensive Study
Jiayu Chen, Chao Yu, Yuqing Xie, Feng Gao, Yinuo Chen, Shu'ang Yu, Wenhao Tang, Shilong Ji, Mo Mu, Yi Wu, Huazhong Yang, Yu Wang
Executing precise and agile flight maneuvers is critical for quadrotors in
various applications. Traditional quadrotor control approaches are limited by
their reliance on flat trajectories or time-consuming optimization, which
restricts their flexibility. Recently, RL-based policy has emerged as a
promising alternative due to its ability to directly map observations to
actions, reducing the need for detailed system knowledge and actuation
constraints. However, a significant challenge remains in bridging the
sim-to-real gap, where RL-based policies often experience instability when
deployed in real world. In this paper, we investigate key factors for learning
robust RL-based control policies that are capable of zero-shot deployment in
real-world quadrotors. We identify five critical factors and we develop a
PPO-based training framework named SimpleFlight, which integrates these five
techniques. We validate the efficacy of SimpleFlight on Crazyflie quadrotor,
demonstrating that it achieves more than a 50% reduction in trajectory tracking
error compared to state-of-the-art RL baselines. The policy derived by
SimpleFlight consistently excels across both smooth polynominal trajectories
and challenging infeasible zigzag trajectories on small thrust-to-weight
quadrotors. In contrast, baseline methods struggle with high-speed or
infeasible trajectories. To support further research and reproducibility, we
integrate SimpleFlight into a GPU-based simulator Omnidrones and provide
open-source access to the code and model checkpoints. We hope SimpleFlight will
offer valuable insights for advancing RL-based quadrotor control. For more
details, visit our project website at
https://sites.google.com/view/simpleflight/.
comment: The first two authors contribute equally; Accepted by RA-L
♻ ☆ AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Hongyi Wang, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang
Vision-Language Models (VLMs) show promise for autonomous driving, yet their
struggle with hallucinations, inefficient reasoning, and limited real-world
validation hinders accurate perception and robust step-by-step reasoning. To
overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework
that, for the first time, integrates Chain-of-Thought (CoT) reasoning with
dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink's
core innovations include: \textbf{(i) Structured Data Generation}, by
establishing an autonomous driving tool library to automatically construct
structured, self-verified reasoning data explicitly incorporating tool usage
for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline},
employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization
(GRPO) to equip VLMs with the capability for autonomous tool invocation; and
\textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel
multi-tool assessment protocol to rigorously evaluate the model's tool
invocation and utilization. Experiments on the DriveLMM-o1 benchmark
demonstrate AgentThink significantly boosts overall reasoning scores by
\textbf{53.91\%} and enhances answer accuracy by \textbf{33.54\%}, while
markedly improving reasoning quality and consistency. Furthermore, ablation
studies and robust zero-shot/few-shot generalization experiments across various
benchmarks underscore its powerful capabilities. These findings highlight a
promising trajectory for developing trustworthy and tool-aware autonomous
driving models.
comment: 18 pages, 8 figures
♻ ☆ Neural Internal Model Control: Learning a Robust Control Policy via Predictive Error Feedback
Accurate motion control in the face of disturbances within complex
environments remains a major challenge in robotics. Classical model-based
approaches often struggle with nonlinearities and unstructured disturbances,
while RL-based methods can be fragile when encountering unseen scenarios. In
this paper, we propose a novel framework, Neural Internal Model Control, which
integrates model-based control with RL-based control to enhance robustness. Our
framework streamlines the predictive model by applying Newton-Euler equations
for rigid-body dynamics, eliminating the need to capture complex
high-dimensional nonlinearities. This internal model combines model-free RL
algorithms with predictive error feedback. Such a design enables a closed-loop
control structure to enhance the robustness and generalizability of the control
system. We demonstrate the effectiveness of our framework on both quadrotors
and quadrupedal robots, achieving superior performance compared to
state-of-the-art methods. Furthermore, real-world deployment on a quadrotor
with rope-suspended payloads highlights the framework's robustness in
sim-to-real transfer. Our code is released at
https://github.com/thu-uav/NeuralIMC.
comment: Accepted by RA-L
♻ ☆ GOTPR: General Outdoor Text-based Place Recognition Using Scene Graph Retrieval with OpenStreetMap
We propose GOTPR, a robust place recognition method designed for outdoor
environments where GPS signals are unavailable. Unlike existing approaches that
use point cloud maps, which are large and difficult to store, GOTPR leverages
scene graphs generated from text descriptions and maps for place recognition.
This method improves scalability by replacing point clouds with compact data
structures, allowing robots to efficiently store and utilize extensive map
data. In addition, GOTPR eliminates the need for custom map creation by using
publicly available OpenStreetMap data, which provides global spatial
information. We evaluated its performance using the KITTI360Pose dataset with
corresponding OpenStreetMap data, comparing it to existing point cloud-based
place recognition methods. The results show that GOTPR achieves comparable
accuracy while significantly reducing storage requirements. In city-scale
tests, it completed processing within a few seconds, making it highly practical
for real-world robotics applications. More information can be found at
https://donghwijung.github.io/GOTPR_page/.
♻ ☆ Multi-layer Motion Planning with Kinodynamic and Spatio-Temporal Constraints SC
We propose a novel, multi-layered planning approach for computing paths that
satisfy both kinodynamic and spatiotemporal constraints. Our three-part
framework first establishes potential sequences to meet spatial constraints,
using them to calculate a geometric lead path. This path then guides an
asymptotically optimal sampling-based kinodynamic planner, which minimizes an
STL-robustness cost to jointly satisfy spatiotemporal and kinodynamic
constraints. In our experiments, we test our method with a velocity-controlled
Ackerman-car model and demonstrate significant efficiency gains compared to
prior art. Additionally, our method is able to generate complex path maneuvers,
such as crossovers, something that previous methods had not demonstrated.
comment: Accepted to ACM Hybrid Systems: Computation and Control (HSCC) 2025
♻ ☆ Robo-Platform: A Robotic System for Recording Sensors and Controlling Robots
Mobile smartphones compactly provide sensors such as cameras, IMUs, GNSS
measurement units, and wireless and wired communication channels required for
robotics projects. They are affordable, portable, and programmable, which makes
them ideal for testing, data acquisition, controlling mobile robots, and many
other robotic applications. A robotic system is proposed in this paper,
consisting of an Android phone, a microcontroller board attached to the phone
via USB, and a remote wireless controller station. In the data acquisition
mode, the Android device can record a dataset of a diverse configuration of
multiple cameras, IMUs, GNSS units, and external USB ADC channels in the rawest
format used for, but not limited to, pose estimation and scene reconstruction
applications. In robot control mode, the Android phone, a microcontroller
board, and other peripherals constitute the mobile or stationary robotic
system. This system is controlled using a remote server connected over Wi-Fi or
Bluetooth. Experiments show that although the SLAM and AR applications can
utilize the acquired data, the proposed system can pave the way for more
advanced algorithms for processing these noisy and sporadic measurements.
Moreover, the characteristics of the communication media are studied, and two
example robotic projects, which involve controlling a toy car and a quadcopter,
are included.
comment: Project repository: https://github.com/m-dayani/robo-platform Youtube
Video: https://youtu.be/BTQ4yLB1bak Dataset:
https://drive.google.com/drive/folders/1OZqdA1xa-SyJ64qL_TibqhtwhR1fWWrx?usp=sharing
♻ ☆ Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration AAAI 2025
Understanding how humans cooperatively utilize semantic knowledge to explore
unfamiliar environments and decide on navigation directions is critical for
house service multi-robot systems. Previous methods primarily focused on
single-robot centralized planning strategies, which severely limited
exploration efficiency. Recent research has considered decentralized planning
strategies for multiple robots, assigning separate planning models to each
robot, but these approaches often overlook communication costs. In this work,
we propose Multimodal Chain-of-Thought Co-Navigation (MCoCoNav), a modular
approach that utilizes multimodal Chain-of-Thought to plan collaborative
semantic navigation for multiple robots. MCoCoNav combines visual perception
with Vision Language Models (VLMs) to evaluate exploration value through
probabilistic scoring, thus reducing time costs and achieving stable outputs.
Additionally, a global semantic map is used as a communication bridge,
minimizing communication overhead while integrating observational results.
Guided by scores that reflect exploration trends, robots utilize this map to
assess whether to explore new frontier points or revisit history nodes.
Experiments on HM3D_v0.2 and MP3D demonstrate the effectiveness of our
approach. Our code is available at https://github.com/FrankZxShen/MCoCoNav.git.
comment: 16 pages, 10 figures, Extended Version of accepted AAAI 2025 Paper
♻ ☆ DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping
Dexterous grasping remains a fundamental yet challenging problem in robotics.
A general-purpose robot must be capable of grasping diverse objects in
arbitrary scenarios. However, existing research typically relies on restrictive
assumptions, such as single-object settings or limited environments, leading to
constrained generalization. We present DexGraspVLA, a hierarchical framework
for general dexterous grasping in cluttered scenes based on RGB image
perception and language instructions. It utilizes a pre-trained Vision-Language
model as the high-level task planner and learns a diffusion-based policy as the
low-level Action controller. The key insight to achieve robust generalization
lies in iteratively transforming diverse language and visual inputs into
domain-invariant representations via foundation models, where imitation
learning can be effectively applied due to the alleviation of domain shift.
Notably, our method achieves a 90+% success rate under thousands of unseen
object, lighting, and background combinations in a "zero-shot" environment.
Empirical analysis confirms the consistency of internal model behavior across
environmental variations, thereby validating our design and explaining its
generalization performance. DexGraspVLA also demonstrates free-form
long-horizon prompt execution, robustness to adversarial objects and human
disturbance, and failure recovery, which are rarely achieved simultaneously in
prior work. Extended application to nonprehensile object grasping further
proves its generality. Code, model, and video are available at
dexgraspvla.github.io.
comment: 26 pages, 12 figures
♻ ☆ Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
Pre-training vision-language representations on human action videos has
emerged as a promising approach to reduce reliance on large-scale expert
demonstrations for training embodied agents. However, prior methods often
employ time contrastive learning based on goal-reaching heuristics,
progressively aligning language instructions from the initial to the final
frame. This overemphasis on future frames can result in erroneous
vision-language associations, as actions may terminate early or include
irrelevant moments in the end. To address this issue, we propose Action
Temporal Coherence Learning (AcTOL) to learn ordered and continuous
vision-language representations without rigid goal-based constraint. AcTOL
treats a video as a continuous trajectory where it (1) contrasts semantic
differences between frames to reflect their natural ordering, and (2) imposes a
local Brownian bridge constraint to ensure smooth transitions across
intermediate frames. Extensive imitation learning experiments on both simulated
and real robots show that the pretrained features significantly enhance
downstream manipulation tasks with high robustness to different linguistic
styles of instructions, offering a viable pathway toward generalized embodied
agents.
♻ ☆ Policy Contrastive Decoding for Robotic Foundation Models
Robotic foundation models, or generalist robot policies, hold immense
potential to enable flexible, general-purpose and dexterous robotic systems.
Despite their advancements, our empirical experiments reveal that existing
robot policies are prone to learning spurious correlations from pre-training
trajectories, adversely affecting their generalization capabilities beyond the
training data. To tackle this, we propose a novel Policy Contrastive Decoding
(PCD) approach, which redirects the robot policy's focus toward object-relevant
visual clues by contrasting action probability distributions derived from
original and object-masked visual inputs. As a training-free method, our PCD
can be used as a plugin to improve different types of robot policies without
needing to finetune or access model weights. We conduct extensive experiments
on top of three open-source robot policies, including the autoregressive policy
OpenVLA and the diffusion-based policies Octo and $\pi_0$. The obtained results
in both simulation and real-world environments prove PCD's flexibility and
effectiveness, e.g., PCD enhances the state-of-the-art policy $\pi_0$ by 8% in
the simulation environment and by 108% in the real-world environment. Code and
demos are publicly available at: https://Koorye.github.io/proj/PCD.
♻ ☆ HCRMP: A LLM-Hinted Contextual Reinforcement Learning Framework for Autonomous Driving
Integrating Large Language Models (LLMs) with Reinforcement Learning (RL) can
enhance autonomous driving (AD) performance in complex scenarios. However,
current LLM-Dominated RL methods over-rely on LLM outputs, which are prone to
hallucinations. Evaluations show that state-of-the-art LLM indicates a
non-hallucination rate of only approximately 57.95% when assessed on essential
driving-related tasks. Thus, in these methods, hallucinations from the LLM can
directly jeopardize the performance of driving policies. This paper argues that
maintaining relative independence between the LLM and the RL is vital for
solving the hallucinations problem. Consequently, this paper is devoted to
propose a novel LLM-Hinted RL paradigm. The LLM is used to generate semantic
hints for state augmentation and policy optimization to assist RL agent in
motion planning, while the RL agent counteracts potential erroneous semantic
indications through policy learning to achieve excellent driving performance.
Based on this paradigm, we propose the HCRMP (LLM-Hinted Contextual
Reinforcement Learning Motion Planner) architecture, which is designed that
includes Augmented Semantic Representation Module to extend state space.
Contextual Stability Anchor Module enhances the reliability of multi-critic
weight hints by utilizing information from the knowledge base. Semantic Cache
Module is employed to seamlessly integrate LLM low-frequency guidance with RL
high-frequency control. Extensive experiments in CARLA validate HCRMP's strong
overall driving performance. HCRMP achieves a task success rate of up to 80.3%
under diverse driving conditions with different traffic densities. Under
safety-critical driving conditions, HCRMP significantly reduces the collision
rate by 11.4%, which effectively improves the driving performance in complex
scenarios.
♻ ☆ Strengthening Generative Robot Policies through Predictive World Modeling
We present generative predictive control (GPC), a learning control framework
that (i) clones a generative diffusion-based policy from expert demonstrations,
(ii) trains a predictive action-conditioned world model from both expert
demonstrations and random explorations, and (iii) synthesizes an online planner
that ranks and optimizes the action proposals from (i) by looking ahead into
the future using the world model from (ii). Across a variety of robotic
manipulation tasks, we demonstrate that GPC consistently outperforms behavior
cloning in both state-based and vision-based settings, in simulation and in the
real world.
comment: Website: https://computationalrobotics.seas.harvard.edu/GPC
♻ ☆ VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving CVPR 2025
Haiming Zhang, Wending Zhou, Yiyao Zhu, Xu Yan, Jiantao Gao, Dongfeng Bai, Yingjie Cai, Bingbing Liu, Shuguang Cui, Zhen Li
This paper introduces VisionPAD, a novel self-supervised pre-training
paradigm designed for vision-centric algorithms in autonomous driving. In
contrast to previous approaches that employ neural rendering with explicit
depth supervision, VisionPAD utilizes more efficient 3D Gaussian Splatting to
reconstruct multi-view representations using only images as supervision.
Specifically, we introduce a self-supervised method for voxel velocity
estimation. By warping voxels to adjacent frames and supervising the rendered
outputs, the model effectively learns motion cues in the sequential data.
Furthermore, we adopt a multi-frame photometric consistency approach to enhance
geometric perception. It projects adjacent frames to the current frame based on
rendered depths and relative poses, boosting the 3D geometric representation
through pure image supervision. Extensive experiments on autonomous driving
datasets demonstrate that VisionPAD significantly improves performance in 3D
object detection, occupancy prediction and map segmentation, surpassing
state-of-the-art pre-training strategies by a considerable margin.
comment: Accepted at CVPR 2025
♻ ☆ Development of a magnetorheological hand exoskeleton featuring a high force-to-power ratio for enhanced grip endurance
Hand exoskeletons have significant potential in labor-intensive fields by
mitigating hand grip fatigue, enhancing hand strength, and preventing injuries.
However, most of the traditional hand exoskeletons are driven by motors, whose
output force is limited in the constrained installation conditions. Besides,
they also come with the disadvantages of high power consumption, complex and
bulky assistive systems, and high instability. In this work, we develop a novel
hand exoskeleton integrated with magnetorheological (MR) clutches that offers a
high force-to-power ratio to improve grip endurance. The clutch features an
enhanced structure design, a micro roller enhancing structure, which can
significantly boost output forces. The experimental data demonstrate that the
clutch can deliver a peak holding force of 380 N with a 1.48 W consumption,
yielding a force-to-power ratio of 256.75N/W, which is 2.35 times higher than
the best-reported actuator used for hand exoskeletons. This capability enables
the designed MRHE to provide approximately 419.79 N support force for gripping.
The designed MR hand exoskeleton is highly integrated, comprising an
exoskeleton frame, MR clutches, a control unit, and a battery. Evaluations
through static grip endurance tests and dynamic carrying and lifting tests
confirm that the MR hand exoskeleton can effectively reduce muscle fatigue,
extend grip endurance, and minimize injuries. These findings highlight its
strong potential for practical applications in repetitive tasks such as
carrying and lifting in industrial settings.