Robotics 47
☆ A Training-Free Framework for Precise Mobile Manipulation of Small Everyday Objects
Many everyday mobile manipulation tasks require precise interaction with
small objects, such as grasping a knob to open a cabinet or pressing a light
switch. In this paper, we develop Servoing with Vision Models (SVM), a
closed-loop training-free framework that enables a mobile manipulator to tackle
such precise tasks involving the manipulation of small objects. SVM employs an
RGB-D wrist camera and uses visual servoing for control. Our novelty lies in
the use of state-of-the-art vision models to reliably compute 3D targets from
the wrist image for diverse tasks and under occlusion due to the end-effector.
To mitigate occlusion artifacts, we employ vision models to out-paint the
end-effector thereby significantly enhancing target localization. We
demonstrate that aided by out-painting methods, open-vocabulary object
detectors can serve as a drop-in module to identify semantic targets (e.g.
knobs) and point tracking methods can reliably track interaction sites
indicated by user clicks. This training-free method obtains an 85% zero-shot
success rate on manipulating unseen objects in novel environments in the real
world, outperforming an open-loop control method and an imitation learning
baseline trained on 1000+ demonstrations by an absolute success rate of 50%.
comment: Project webpage:
☆ NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants ICRA2025
Navigating unfamiliar environments presents significant challenges for
household robots, requiring the ability to recognize and reason about novel
decoration and layout. Existing reinforcement learning methods cannot be
directly transferred to new environments, as they typically rely on extensive
mapping and exploration, leading to time-consuming and inefficient. To address
these challenges, we try to transfer the logical knowledge and the
generalization ability of pre-trained foundation models to zero-shot
navigation. By integrating a large vision-language model with a diffusion
network, our approach named \mname ~constructs a visual predictor that
continuously predicts the agent's potential observations in the next step which
can assist robots generate robust actions. Furthermore, to adapt the temporal
property of navigation, we introduce temporal historical information to ensure
that the predicted image is aligned with the navigation scene. We then
carefully designed an information fusion framework that embeds the predicted
future frames as guidance into goal-reaching policy to solve downstream image
navigation tasks. This approach enhances navigation control and generalization
across both simulated and real-world environments. Through extensive
experimentation, we demonstrate the robustness and versatility of our method,
showcasing its potential to improve the efficiency and effectiveness of robotic
navigation in diverse settings.
comment: Accepted to ICRA2025
☆ The NavINST Dataset for Multi-Sensor Autonomous Navigation
Paulo Ricardo Marques de Araujo, Eslam Mounier, Qamar Bader, Emma Dawson, Shaza I. Kaoud Abdelaziz, Ahmed Zekry, Mohamed Elhabiby, Aboelmagd Noureldin
The NavINST Laboratory has developed a comprehensive multisensory dataset
from various road-test trajectories in urban environments, featuring diverse
lighting conditions, including indoor garage scenarios with dense 3D maps. This
dataset includes multiple commercial-grade IMUs and a high-end tactical-grade
IMU. Additionally, it contains a wide array of perception-based sensors, such
as a solid-state LiDAR - making it one of the first datasets to do so - a
mechanical LiDAR, four electronically scanning RADARs, a monocular camera, and
two stereo cameras. The dataset also includes forward speed measurements
derived from the vehicle's odometer, along with accurately post-processed
high-end GNSS/IMU data, providing precise ground truth positioning and
navigation information. The NavINST dataset is designed to support advanced
research in high-precision positioning, navigation, mapping, computer vision,
and multisensory fusion. It offers rich, multi-sensor data ideal for developing
and validating robust algorithms for autonomous vehicles. Finally, it is fully
integrated with the ROS, ensuring ease of use and accessibility for the
research community. The complete dataset and development tools are available at
comment: 14 pages, 20 figures
☆ Minimally sufficient structures for information-feedback policies
In this paper, we consider robotic tasks which require a desirable outcome to
be achieved in the physical world that the robot is embedded in and interacting
with. Accomplishing this objective requires designing a filter that maintains a
useful representation of the physical world and a policy over the filter
states. A filter is seen as the robot's perspective of the physical world based
on limited sensing, memory, and computation and it is represented as a
transition system over a space of information states. To this end, the
interactions result from the coupling of an internal and an external system, a
filter, and the physical world, respectively, through a sensor mapping and an
information-feedback policy. Within this setup, we look for sufficient
structures, that is, sufficient internal systems and sensors, for accomplishing
a given task. We establish necessary and sufficient conditions for these
structures to satisfy for information-feedback policies that can be defined
over the states of an internal system to exist. We also show that under mild
assumptions, minimal internal systems that can represent a particular
plan/policy described over the action-observation histories exist and are
unique. Finally, the results are applied to determine sufficient structures for
distance-optimal navigation in a polygonal environment.
comment: The 16th International Workshop on the Algorithmic Foundations of
☆ An Online Optimization-Based Trajectory Planning Approach for Cooperative Landing Tasks
This paper presents a real-time trajectory planning scheme for a
heterogeneous multi-robot system (consisting of a quadrotor and a ground mobile
robot) for a cooperative landing task, where the landing position, landing
time, and coordination between the robots are determined autonomously under the
consideration of feasibility and user specifications. The proposed framework
leverages the potential of the complementarity constraint as a decision-maker
and an indicator for diverse cooperative tasks and extends it to the
collaborative landing scenario. In a potential application of the proposed
methodology, a ground mobile robot may serve as a mobile charging station and
coordinates in real-time with a quadrotor to be charged, facilitating a safe
and efficient rendezvous and landing. We verified the generated trajectories in
simulation and real-world applications, demonstrating the real-time
capabilities of the proposed landing planning framework.
☆ Exploring Embodied Emotional Communication: A Human-oriented Review of Mediated Social Touch
This paper offers a structured understanding of mediated social touch (MST)
using a human-oriented approach, through an extensive review of literature
spanning tactile interfaces, emotional information, mapping mechanisms, and the
dynamics of human-human and human-robot interactions. By investigating the
existing and exploratory mapping strategies of the 37 selected MST cases, we
established the emotional expression space of MSTs that accommodated a diverse
spectrum of emotions by integrating the categorical and Valence-arousal models,
showcasing how emotional cues can be translated into tactile signals. Based on
the expressive capacity of MSTs, a practical design space was structured
encompassing factors such as the body locations, device form, tactile
modalities, and parameters. We also proposed various design strategies for MSTs
including workflow, evaluation methods, and ethical and cultural
considerations, as well as several future research directions. MSTs' potential
is reflected not only in conveying emotional information but also in fostering
empathy, comfort, and connection in both human-human and human-robot
interactions. This paper aims to serve as a comprehensive reference for design
researchers and practitioners, which helps expand the scope of emotional
communication of MSTs, facilitating the exploration of diverse applications of
affective haptics, and enhancing the naturalness and sociability of haptic
comment: This paper is 41 pages long, including references and appendices, and
contains 8 figures. The manuscript has been accepted for publication in CCF
Transactions on Pervasive Computing and Interaction but has not yet been
officially published
☆ 3D Gaussian Splatting aided Localization for Large and Complex Indoor-Environments
The field of visual localization has been researched for several decades and
has meanwhile found many practical applications. Despite the strong progress in
this field, there are still challenging situations in which established methods
fail. We present an approach to significantly improve the accuracy and
reliability of established visual localization methods by adding rendered
images. In detail, we first use a modern visual SLAM approach that provides a
3D Gaussian Splatting (3DGS) based map to create reference data. We demonstrate
that enriching reference data with images rendered from 3DGS at randomly
sampled poses significantly improves the performance of both geometry-based
visual localization and Scene Coordinate Regression (SCR) methods. Through
comprehensive evaluation in a large industrial environment, we analyze the
performance impact of incorporating these additional rendered views.
☆ Multi-Covering a Point Set by $m$ Disks with Minimum Total Area
A common robotics sensing problem is to place sensors to robustly monitor a
set of assets, where robustness is assured by requiring asset $p$ to be
monitored by at least $\kappa(p)$ sensors. Given $n$ assets that must be
observed by $m$ sensors, each with a disk-shaped sensing region, where should
the sensors be placed to minimize the total area observed? We provide and
analyze a fast heuristic for this problem. We then use the heuristic to
initialize an exact Integer Programming solution. Subsequently, we enforce
separation constraints between the sensors by modifying the integer program
formulation and by changing the disk candidate set.
comment: 7 Pages, 7 figures
☆ Muscle Activation Estimation by Optimzing the Musculoskeletal Model for Personalized Strength and Conditioning Training
Musculoskeletal models are pivotal in the domains of rehabilitation and
resistance training to analyze muscle conditions. However, individual
variability in musculoskeletal parameters and the immeasurability of some
internal biomechanical variables pose significant obstacles to accurate
personalized modelling. Furthermore, muscle activation estimation can be
challenging due to the inherent redundancy of the musculoskeletal system, where
multiple muscles drive a single joint. This study develops a whole-body
musculoskeletal model for strength and conditioning training and calibrates
relevant muscle parameters with an electromyography-based optimization method.
By utilizing the personalized musculoskeletal model, muscle activation can be
subsequently estimated to analyze the performance of exercises. Bench press and
deadlift are chosen for experimental verification to affirm the efficacy of
this approach.
☆ Active Illumination for Visual Ego-Motion Estimation in the Dark
Visual Odometry (VO) and Visual SLAM (V-SLAM) systems often struggle in
low-light and dark environments due to the lack of robust visual features. In
this paper, we propose a novel active illumination framework to enhance the
performance of VO and V-SLAM algorithms in these challenging conditions. The
developed approach dynamically controls a moving light source to illuminate
highly textured areas, thereby improving feature extraction and tracking.
Specifically, a detector block, which incorporates a deep learning-based
enhancing network, identifies regions with relevant features. Then, a pan-tilt
controller is responsible for guiding the light beam toward these areas, so
that to provide information-rich images to the ego-motion estimation algorithm.
Experimental results on a real robotic platform demonstrate the effectiveness
of the proposed method, showing a reduction in the pose estimation error up to
75% with respect to a traditional fixed lighting technique.
☆ Human-Like Robot Impedance Regulation Skill Learning from Human-Human Demonstrations
Humans are experts in collaborating with others physically by regulating
compliance behaviors based on the perception of their partner states and the
task requirements. Enabling robots to develop proficiency in human
collaboration skills can facilitate more efficient human-robot collaboration
(HRC). This paper introduces an innovative impedance regulation skill learning
framework for achieving HRC in multiple physical collaborative tasks. The
framework is designed to adjust the robot compliance to the human partner
states while adhering to reference trajectories provided by human-human
demonstrations. Specifically, electromyography (EMG) signals from human muscles
are collected and analyzed to extract limb impedance, representing compliance
behaviors during demonstrations. Human endpoint motions are captured and
represented using a probabilistic learning method to create reference
trajectories and corresponding impedance profiles. Meanwhile, an LSTMbased
module is implemented to develop task-oriented impedance regulation policies by
mapping the muscle synergistic contributions between two demonstrators.
Finally, we propose a wholebody impedance controller for a human-like robot,
coordinating joint outputs to achieve the desired impedance and reference
trajectory during task execution. Experimental validation was conducted through
a collaborative transportation task and two interactive Tai Chi pushing hands
tasks, demonstrating superior performance from the perspective of interactive
forces compared to a constant impedance control method.
comment: 12 pages, 12 figures
☆ A Framework for Semantics-based Situational Awareness during Mobile Robot Deployments
Tianshu Ruan, Aniketh Ramesh, Hao Wang, Alix Johnstone-Morfoisse, Gokcenur Altindal, Paul Norman, Grigoris Nikolaou, Rustam Stolkin, Manolis Chiou
Deployment of robots into hazardous environments typically involves a
``Human-Robot Teaming'' (HRT) paradigm, in which a human supervisor interacts
with a remotely operating robot inside the hazardous zone. Situational
Awareness (SA) is vital for enabling HRT, to support navigation, planning, and
decision-making. This paper explores issues of higher-level ``semantic''
information and understanding in SA. In semi-autonomous, or variable-autonomy
paradigms, different types of semantic information may be important, in
different ways, for both the human operator and an autonomous agent controlling
the robot. We propose a generalizable framework for acquiring and combining
multiple modalities of semantic-level SA during remote deployments of mobile
robots. We demonstrate the framework with an example application of search and
rescue (SAR) in disaster response robotics. We propose a set of ``environment
semantic indicators" that can reflect a variety of different types of semantic
information, e.g. indicators of risk, or signs of human activity, as the robot
encounters different scenes. Based on these indicators, we propose a metric to
describe the overall situation of the environment called ``Situational Semantic
Richness (SSR)". This metric combines multiple semantic indicators to summarise
the overall situation. The SSR indicates if an information-rich and complex
situation has been encountered, which may require advanced reasoning for robots
and humans and hence the attention of the expert human operator. The framework
is tested on a Jackal robot in a mock-up disaster response environment.
Experimental results demonstrate that the proposed semantic indicators are
sensitive to changes in different modalities of semantic information in
different scenes, and the SSR metric reflects overall semantic changes in the
situations encountered.
☆ An Adaptive Data-Enabled Policy Optimization Approach for Autonomous Bicycle Control
This paper presents a unified control framework that integrates a Feedback
Linearization (FL) controller in the inner loop with an adaptive Data-Enabled
Policy Optimization (DeePO) controller in the outer loop to balance an
autonomous bicycle. While the FL controller stabilizes and partially linearizes
the inherently unstable and nonlinear system, its performance is compromised by
unmodeled dynamics and time-varying characteristics. To overcome these
limitations, the DeePO controller is introduced to enhance adaptability and
robustness. The initial control policy of DeePO is obtained from a finite set
of offline, persistently exciting input and state data. To improve stability
and compensate for system nonlinearities and disturbances, a
robustness-promoting regularizer refines the initial policy, while the adaptive
section of the DeePO framework is enhanced with a forgetting factor to improve
adaptation to time-varying dynamics. The proposed DeePO+FL approach is
evaluated through simulations and real-world experiments on an instrumented
autonomous bicycle. Results demonstrate its superiority over the FL-only
approach, achieving more precise tracking of the reference lean angle and lean
☆ SLAMSpoof: Practical LiDAR Spoofing Attacks on Localization Systems Guided by Scan Matching Vulnerability Analysis ICRA
Rokuto Nagata, Kenji Koide, Yuki Hayakawa, Ryo Suzuki, Kazuma Ikeda, Ozora Sako, Qi Alfred Chen, Takami Sato, Kentaro Yoshioka
Accurate localization is essential for enabling modern full self-driving
services. These services heavily rely on map-based traffic information to
reduce uncertainties in recognizing lane shapes, traffic light locations, and
traffic signs. Achieving this level of reliance on map information requires
centimeter-level localization accuracy, which is currently only achievable with
LiDAR sensors. However, LiDAR is known to be vulnerable to spoofing attacks
that emit malicious lasers against LiDAR to overwrite its measurements. Once
localization is compromised, the attack could lead the victim off roads or make
them ignore traffic lights. Motivated by these serious safety implications, we
design SLAMSpoof, the first practical LiDAR spoofing attack on localization
systems for self-driving to assess the actual attack significance on autonomous
vehicles. SLAMSpoof can effectively find the effective attack location based on
our scan matching vulnerability score (SMVS), a point-wise metric representing
the potential vulnerability to spoofing attacks. To evaluate the effectiveness
of the attack, we conduct real-world experiments on ground vehicles and confirm
its high capability in real-world scenarios, inducing position errors of
$\geq$4.2 meters (more than typical lane width) for all 3 popular LiDAR-based
localization algorithms. We finally discuss the potential countermeasures of
this attack. Code is available at
comment: 7pages, 6figures, accepted at IEEE International Conference on
Robotics and Automation (ICRA) 2025
☆ MILE: Model-based Intervention Learning ICRA
Imitation learning techniques have been shown to be highly effective in
real-world control scenarios, such as robotics. However, these approaches not
only suffer from compounding error issues but also require human experts to
provide complete trajectories. Although there exist interactive methods where
an expert oversees the robot and intervenes if needed, these extensions usually
only utilize the data collected during intervention periods and ignore the
feedback signal hidden in non-intervention timesteps. In this work, we create a
model to formulate how the interventions occur in such cases, and show that it
is possible to learn a policy with just a handful of expert interventions. Our
key insight is that it is possible to get crucial information about the quality
of the current state and the optimality of the chosen action from expert
feedback, regardless of the presence or the absence of intervention. We
evaluate our method on various discrete and continuous simulation environments,
a real-world robotic manipulation task, as well as a human subject study.
Videos and the code can be found at .
comment: International Conference on Robotics and Automation (ICRA)
☆ VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation ICLR 2025
Vision-language-action models (VLAs) have become increasingly popular in
robot manipulation for their end-to-end design and remarkable performance.
However, existing VLAs rely heavily on vision-language models (VLMs) that only
support text-based instructions, neglecting the more natural speech modality
for human-robot interaction. Traditional speech integration methods usually
involves a separate speech recognition system, which complicates the model and
introduces error propagation. Moreover, the transcription procedure would lose
non-semantic information in the raw speech, such as voiceprint, which may be
crucial for robots to successfully complete customized tasks. To overcome above
challenges, we propose VLAS, a novel end-to-end VLA that integrates speech
recognition directly into the robot policy model. VLAS allows the robot to
understand spoken commands through inner speech-text alignment and produces
corresponding actions to fulfill the task. We also present two new datasets,
SQA and CSI, to support a three-stage tuning process for speech instructions,
which empowers VLAS with the ability of multimodal interaction across text,
image, speech, and robot actions. Taking a step further, a voice
retrieval-augmented generation (RAG) paradigm is designed to enable our model
to effectively handle tasks that require individual-specific knowledge. Our
extensive experiments show that VLAS can effectively accomplish robot
manipulation tasks with diverse speech commands, offering a seamless and
customized interaction experience.
comment: Accepted as a conference paper at ICLR 2025
☆ Improving Collision-Free Success Rate For Object Goal Visual Navigation Via Two-Stage Training With Collision Prediction
The object goal visual navigation is the task of navigating to a specific
target object using egocentric visual observations. Recent end-to-end
navigation models based on deep reinforcement learning have achieved remarkable
performance in finding and reaching target objects. However, the collision
problem of these models during navigation remains unresolved, since the
collision is typically neglected when evaluating the success. Although
incorporating a negative reward for collision during training appears
straightforward, it results in a more conservative policy, thereby limiting the
agent's ability to reach targets. In addition, many of these models utilize
only RGB observations, further increasing the difficulty of collision avoidance
without depth information. To address these limitations, a new concept --
collision-free success is introduced to evaluate the ability of navigation
models to find a collision-free path towards the target object. A two-stage
training method with collision prediction is proposed to improve the
collision-free success rate of the existing navigation models using RGB
observations. In the first training stage, the collision prediction module
supervises the agent's collision states during exploration to learn to predict
the possible collision. In the second stage, leveraging the trained collision
prediction, the agent learns to navigate to the target without collision. The
experimental results in the AI2-THOR environment demonstrate that the proposed
method greatly improves the collision-free success rate of different navigation
models and outperforms other comparable collision-avoidance methods.
☆ Ephemerality meets LiDAR-based Lifelong Mapping ICRA 2025
Lifelong mapping is crucial for the long-term deployment of robots in dynamic
environments. In this paper, we present ELite, an ephemerality-aided
LiDAR-based lifelong mapping framework which can seamlessly align multiple
session data, remove dynamic objects, and update maps in an end-to-end fashion.
Map elements are typically classified as static or dynamic, but cases like
parked cars indicate the need for more detailed categories than binary. Central
to our approach is the probabilistic modeling of the world into two-stage
$\textit{ephemerality}$, which represent the transiency of points in the map
within two different time scales. By leveraging the spatiotemporal context
encoded in ephemeralities, ELite can accurately infer transient map elements,
maintain a reliable up-to-date static map, and improve robustness in aligning
the new data in a more fine-grained manner. Extensive real-world experiments on
long-term datasets demonstrate the robustness and effectiveness of our system.
The source code is publicly available for the robotics community:
comment: 6+2 pages, 11 figures, accepted at ICRA 2025
☆ MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation
Lingfeng Zhang, Xiaoshuai Hao, Qinwen Xu, Qiang Zhang, Xinyao Zhang, Pengwei Wang, Jing Zhang, Zhongyuan Wang, Shanghang Zhang, Renjing Xu
Vision-and-language navigation (VLN) is a key task in Embodied AI, requiring
agents to navigate diverse and unseen environments while following natural
language instructions. Traditional approaches rely heavily on historical
observations as spatio-temporal contexts for decision making, leading to
significant storage and computational overhead. In this paper, we introduce
MapNav, a novel end-to-end VLN model that leverages Annotated Semantic Map
(ASM) to replace historical frames. Specifically, our approach constructs a
top-down semantic map at the start of each episode and update it at each
timestep, allowing for precise object mapping and structured navigation
information. Then, we enhance this map with explicit textual labels for key
regions, transforming abstract semantics into clear navigation cues and
generate our ASM. MapNav agent using the constructed ASM as input, and use the
powerful end-to-end capabilities of VLM to empower VLN. Extensive experiments
demonstrate that MapNav achieves state-of-the-art (SOTA) performance in both
simulated and real-world environments, validating the effectiveness of our
method. Moreover, we will release our ASM generation source code and dataset to
ensure reproducibility, contributing valuable resources to the field. We
believe that our proposed MapNav can be used as a new memory representation
method in VLN, paving the way for future research in this field.
☆ Physics-Aware Robotic Palletization with Online Masking Inference ICRA 2025
Tianqi Zhang, Zheng Wu, Yuxin Chen, Yixiao Wang, Boyuan Liang, Scott Moura, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan
The efficient planning of stacking boxes, especially in the online setting
where the sequence of item arrivals is unpredictable, remains a critical
challenge in modern warehouse and logistics management. Existing solutions
often address box size variations, but overlook their intrinsic and physical
properties, such as density and rigidity, which are crucial for real-world
applications. We use reinforcement learning (RL) to solve this problem by
employing action space masking to direct the RL policy toward valid actions.
Unlike previous methods that rely on heuristic stability assessments which are
difficult to assess in physical scenarios, our framework utilizes online
learning to dynamically train the action space mask, eliminating the need for
manual heuristic design. Extensive experiments demonstrate that our proposed
method outperforms existing state-of-the-arts. Furthermore, we deploy our
learned task planner in a real-world robotic palletizer, validating its
practical applicability in operational settings.
comment: Accepted by ICRA 2025
☆ Generative Predictive Control: Flow Matching Policies for Dynamic and Difficult-to-Demonstrate Tasks
Generative control policies have recently unlocked major progress in
robotics. These methods produce action sequences via diffusion or flow
matching, with training data provided by demonstrations. But despite enjoying
considerable success on difficult manipulation problems, generative policies
come with two key limitations. First, behavior cloning requires expert
demonstrations, which can be time-consuming and expensive to obtain. Second,
existing methods are limited to relatively slow, quasi-static tasks. In this
paper, we leverage a tight connection between sampling-based predictive control
and generative modeling to address each of these issues. In particular, we
introduce generative predictive control, a supervised learning framework for
tasks with fast dynamics that are easy to simulate but difficult to
demonstrate. We then show how trained flow-matching policies can be
warm-started at run-time, maintaining temporal consistency and enabling fast
feedback rates. We believe that generative predictive control offers a
complementary approach to existing behavior cloning methods, and hope that it
paves the way toward generalist policies that extend beyond quasi-static
demonstration-oriented tasks.
☆ Object-Pose Estimation With Neural Population Codes
Robotic assembly tasks require object-pose estimation, particularly for tasks
that avoid costly mechanical constraints. Object symmetry complicates the
direct mapping of sensory input to object rotation, as the rotation becomes
ambiguous and lacks a unique training target. Some proposed solutions involve
evaluating multiple pose hypotheses against the input or predicting a
probability distribution, but these approaches suffer from significant
computational overhead. Here, we show that representing object rotation with a
neural population code overcomes these limitations, enabling a direct mapping
to rotation and end-to-end learning. As a result, population codes facilitate
fast and accurate pose estimation. On the T-LESS dataset, we achieve inference
in 3.2 milliseconds on an Apple M1 CPU and a Maximum Symmetry-Aware Surface
Distance accuracy of 84.7% using only gray-scale image input, compared to 69.7%
accuracy when directly mapping to pose.
☆ Low-Complexity Cooperative Payload Transportation for Nonholonomic Mobile Robots Under Scalable Constraints
Cooperative transportation, a key aspect of logistics
cyber-physical systems (CPS), is typically approached using dis tributed
control and optimization-based methods. The distributed
control methods consume less time, but poorly handle and extend
to multiple constraints. Instead, optimization-based methods
handle constraints effectively, but they are usually centralized,
time-consuming and thus not easily scalable to numerous robots.
To overcome drawbacks of both, we propose a novel cooperative
transportation method for nonholonomic mobile robots by im proving
conventional formation control, which is distributed, has
a low time-complexity and accommodates scalable constraints.
The proposed control-based method is testified on a cable suspended payload
and divided into two parts, including robot
trajectory generation and trajectory tracking. Unlike most time consuming
trajectory generation methods, ours can generate
trajectories with only constant time-complexity, needless of global
maps. As for trajectory tracking, our control-based method not
only scales easily to multiple constraints as those optimization based
methods, but reduces their time-complexity from poly nomial to linear.
Simulations and experiments can verify the
feasibility of our method.
♻ ☆ Robotic Table Tennis: A Case Study into a High Speed Learning System
David B. D'Ambrosio, Jonathan Abelian, Saminda Abeyruwan, Michael Ahn, Alex Bewley, Justin Boyd, Krzysztof Choromanski, Omar Cortes, Erwin Coumans, Tianli Ding, Wenbo Gao, Laura Graesser, Atil Iscen, Navdeep Jaitly, Deepali Jain, Juhana Kangaspunta, Satoshi Kataoka, Gus Kouretas, Yuheng Kuang, Nevena Lazic, Corey Lynch, Reza Mahjourian, Sherry Q. Moore, Thinh Nguyen, Ken Oslund, Barney J Reed, Krista Reymann, Pannag R. Sanketi, Anish Shankar, Pierre Sermanet, Vikas Sindhwani, Avi Singh, Vincent Vanhoucke, Grace Vesom, Peng Xu
We present a deep-dive into a real-world robotic learning system that, in
previous work, was shown to be capable of hundreds of table tennis rallies with
a human and has the ability to precisely return the ball to desired targets.
This system puts together a highly optimized perception subsystem, a high-speed
low-latency robot controller, a simulation paradigm that can prevent damage in
the real world and also train policies for zero-shot transfer, and automated
real world environment resets that enable autonomous training and evaluation on
physical robots. We complement a complete system description, including
numerous design decisions that are typically not widely disseminated, with a
collection of studies that clarify the importance of mitigating various sources
of latency, accounting for training and deployment distribution shifts,
robustness of the perception system, sensitivity to policy hyper-parameters,
and choice of action space. A video demonstrating the components of the system
and details of experimental results can be found at
comment: Published and presented at Robotics: Science and Systems (RSS2023)
♻ ☆ Personalized Instance-based Navigation Toward User-Specific Objects in Realistic Environments NeurIPS 2024
In the last years, the research interest in visual navigation towards objects
in indoor environments has grown significantly. This growth can be attributed
to the recent availability of large navigation datasets in photo-realistic
simulated environments, like Gibson and Matterport3D. However, the navigation
tasks supported by these datasets are often restricted to the objects present
in the environment at acquisition time. Also, they fail to account for the
realistic scenario in which the target object is a user-specific instance that
can be easily confused with similar objects and may be found in multiple
locations within the environment. To address these limitations, we propose a
new task denominated Personalized Instance-based Navigation (PIN), in which an
embodied agent is tasked with locating and reaching a specific personal object
by distinguishing it among multiple instances of the same category. The task is
accompanied by PInNED, a dedicated new dataset composed of photo-realistic
scenes augmented with additional 3D objects. In each episode, the target object
is presented to the agent using two modalities: a set of visual reference
images on a neutral background and manually annotated textual descriptions.
Through comprehensive evaluations and analyses, we showcase the challenges of
the PIN task as well as the performance and shortcomings of currently available
methods designed for object-driven navigation, considering modular and
end-to-end agents.
comment: NeurIPS 2024 Datasets and Benchmarks Track. Project page:
♻ ☆ ArrayBot: Reinforcement Learning for Generalizable Distributed Manipulation through Touch ICRA24
We present ArrayBot, a distributed manipulation system consisting of a $16
\times 16$ array of vertically sliding pillars integrated with tactile sensors,
which can simultaneously support, perceive, and manipulate the tabletop
objects. Towards generalizable distributed manipulation, we leverage
reinforcement learning (RL) algorithms for the automatic discovery of control
policies. In the face of the massively redundant actions, we propose to reshape
the action space by considering the spatially local action patch and the
low-frequency actions in the frequency domain. With this reshaped action space,
we train RL agents that can relocate diverse objects through tactile
observations only. Surprisingly, we find that the discovered policy can not
only generalize to unseen object shapes in the simulator but also transfer to
the physical robot without any domain randomization. Leveraging the deployed
policy, we present abundant real-world manipulation tasks, illustrating the
vast potential of RL on ArrayBot for distributed manipulation.
comment: ICRA24
♻ ☆ ACROSS: A Deformation-Based Cross-Modal Representation for Robotic Tactile Perception ICRA 2025
Tactile perception is essential for human interaction with the environment
and is becoming increasingly crucial in robotics. Tactile sensors like the
BioTac mimic human fingertips and provide detailed interaction data. Despite
its utility in applications like slip detection and object identification, this
sensor is now deprecated, making many valuable datasets obsolete. However,
recreating similar datasets with newer sensor technologies is both tedious and
time-consuming. Therefore, adapting these existing datasets for use with new
setups and modalities is crucial. In response, we introduce ACROSS, a novel
framework for translating data between tactile sensors by exploiting sensor
deformation information. We demonstrate the approach by translating BioTac
signals into the DIGIT sensor. Our framework consists of first converting the
input signals into 3D deformation meshes. We then transition from the 3D
deformation mesh of one sensor to the mesh of another, and finally convert the
generated 3D deformation mesh into the corresponding output space. We
demonstrate our approach to the most challenging problem of going from a
low-dimensional tactile representation to a high-dimensional one. In
particular, we transfer the tactile signals of a BioTac sensor to DIGIT tactile
images. Our approach enables the continued use of valuable datasets and data
exchange between groups with different setups.
comment: Accepted to 2025 IEEE Conference on Robotics and Automation (ICRA
2025). arXiv admin note: text overlap with arXiv:2410.14310
♻ ☆ Traffic Scene Generation from Natural Language Description for Autonomous Vehicles with Large Language Model
Text-to-scene generation typically limits environmental diversity by
generating key scenarios along predetermined paths. To address these
constraints, we propose a novel text-to-traffic scene framework that leverages
a large language model (LLM) to autonomously generate diverse traffic scenarios
for the CARLA simulator based on natural language descriptions. Our pipeline
comprises several key stages: (1) Prompt Analysis, where natural language
inputs are decomposed; (2) Road Retrieval, selecting optimal roads from a
database; (3) Agent Planning, detailing agent types and behaviors; (4) Road
Ranking, scoring roads to match scenario requirements; and (5) Scene
Generation, rendering the planned scenarios in the simulator. This framework
supports both routine and critical traffic scenarios, enhancing its
applicability. We demonstrate that our approach not only diversifies agent
planning and road selection but also significantly reduces the average
collision rate from 8% to 3.5% in SafeBench. Additionally, our framework
improves narration and reasoning for driving captioning tasks. Our
contributions and resources are publicly available at
comment: update to the newest version
♻ ☆ Soft Synergies: Model Order Reduction of Hybrid Soft-Rigid Robots via Optimal Strain Parameterization
Abdulaziz Y. Alkayas, Anup Teejo Mathew, Daniel Feliu-Talegon, Ping Deng, Thomas George Thuruthel, Federico Renda
Soft robots offer remarkable adaptability and safety advantages over rigid
robots, but modeling their complex, nonlinear dynamics remains challenging.
Strain-based models have recently emerged as a promising candidate to describe
such systems, however, they tend to be high-dimensional and time-consuming.
This paper presents a novel model order reduction approach for soft and hybrid
robots by combining strain-based modeling with Proper Orthogonal Decomposition
(POD). The method identifies optimal coupled strain basis functions -- or
mechanical synergies -- from simulation data, enabling the description of soft
robot configurations with a minimal number of generalized coordinates. The
reduced order model (ROM) achieves substantial dimensionality reduction in the
configuration space while preserving accuracy. Rigorous testing demonstrates
the interpolation and extrapolation capabilities of the ROM for soft
manipulators under static and dynamic conditions. The approach is further
validated on a snake-like hyper-redundant rigid manipulator and a closed-chain
system with soft and rigid components, illustrating its broad applicability.
Moreover, the approach is leveraged for shape estimation of a real six-actuator
soft manipulator using only two position markers, showcasing its practical
utility. Finally, the ROM's dynamic and static behavior is validated
experimentally against a parallel hybrid soft-rigid system, highlighting its
effectiveness in representing the High-Order Model (HOM) and the real system.
This POD-based ROM offers significant computational speed-ups, paving the way
for real-time simulation and control of complex soft and hybrid robots.
♻ ☆ Invisible Servoing: a Visual Servoing Approach with Return-Conditioned Latent Diffusion
In this paper, we present a novel visual servoing (VS) approach based on
latent Denoising Diffusion Probabilistic Models (DDPMs), that explores the
application of generative models for vision-based navigation of UAVs (Uncrewed
Aerial Vehicles). Opposite to classical VS methods, the proposed approach
allows reaching the desired target view, even when the target is initially not
visible. This is possible thanks to the learning of a latent representation
that the DDPM uses for planning and a dataset of trajectories encompassing
target-invisible initial views. A compact representation is learned from raw
images using a Cross-Modal Variational Autoencoder. Given the current image,
the DDPM generates trajectories in the latent space driving the robotic
platform to the desired visual target. The approach has been validated in
simulation using two generic multi-rotor UAVs (a quadrotor and a hexarotor).
The results show that we can successfully reach the visual target, even if not
visible in the initial view.
♻ ☆ Bridging Adaptivity and Safety: Learning Agile Collision-Free Locomotion Across Varied Physics
Real-world legged locomotion systems often need to reconcile agility and
safety for different scenarios. Moreover, the underlying dynamics are often
unknown and time-variant (e.g., payload, friction). In this paper, we introduce
BAS (Bridging Adaptivity and Safety), which builds upon the pipeline of prior
work Agile But Safe (ABS)(He et al.) and is designed to provide adaptive safety
even in dynamic environments with uncertainties. BAS involves an agile policy
to avoid obstacles rapidly and a recovery policy to prevent collisions, a
physical parameter estimator that is concurrently trained with agile policy,
and a learned control-theoretic RA (reach-avoid) value network that governs the
policy switch. Also, the agile policy and RA network are both conditioned on
physical parameters to make them adaptive. To mitigate the distribution shift
issue, we further introduce an on-policy fine-tuning phase for the estimator to
enhance its robustness and accuracy. The simulation results show that BAS
achieves 50% better safety than baselines in dynamic environments while
maintaining a higher speed on average. In real-world experiments, BAS shows its
capability in complex environments with unknown physics (e.g., slippery floors
with unknown frictions, unknown payloads up to 8kg), while baselines lack
adaptivity, leading to collisions or. degraded agility. As a result, BAS
achieves a 19.8% increase in speed and gets a 2.36 times lower collision rate
than ABS in the real world. Videos:
comment: 11 Pages, 6 Figures
♻ ☆ Why Sample Space Matters: Keyframe Sampling Optimization for LiDAR-based Place Recognition
Recent advances in robotics are driving real-world autonomy for long-term and
large-scale missions, where loop closures via place recognition are vital for
mitigating pose estimation drift. However, achieving real-time performance
remains challenging for resource-constrained mobile robots and multi-robot
systems due to the computational burden of high-density sampling, which
increases the complexity of comparing and verifying query samples against a
growing map database. Conventional methods often retain redundant information
or miss critical data by relying on fixed sampling intervals or operating in
3-D space instead of the descriptor feature space. To address these challenges,
we introduce the concept of sample space and propose a novel keyframe sampling
approach for LiDAR-based place recognition. Our method minimizes redundancy
while preserving essential information in the hyper-dimensional descriptor
space, supporting both learning-based and handcrafted descriptors. The proposed
approach incorporates a sliding window optimization strategy to ensure
efficient keyframe selection and real-time performance, enabling seamless
integration into robotic pipelines. In sum, our approach demonstrates robust
performance across diverse datasets, with the ability to adapt seamlessly from
indoor to outdoor scenarios without parameter tuning, reducing loop closure
detection times and memory requirements.
comment: 20 pages, 17 figures, 6 tables. Revised
♻ ☆ pySLAM: An Open-Source, Modular, and Extensible Framework for SLAM
pySLAM is an open-source Python framework for Visual SLAM, supporting
monocular, stereo, and RGB-D cameras. It provides a flexible interface for
integrating both classical and modern local features, making it adaptable to
various SLAM tasks. The framework includes different loop closure methods, a
volumetric reconstruction pipeline, and support for depth prediction models.
Additionally, it offers a suite of tools for visual odometry and SLAM
applications. Designed for both beginners and experienced researchers, pySLAM
encourages community contributions, fostering collaborative development in the
field of Visual SLAM.
♻ ☆ FuzzRisk: Online Collision Risk Estimation for Autonomous Vehicles based on Depth-Aware Object Detection via Fuzzy Inference ICRA 2025
This paper presents a novel monitoring framework that infers the level of
collision risk for autonomous vehicles (AVs) based on their object detection
performance. The framework takes two sets of predictions from different
algorithms and associates their inconsistencies with the collision risk via
fuzzy inference. The first set of predictions is obtained by retrieving
safety-critical 2.5D objects from a depth map, and the second set comes from
the ordinary AV's 3D object detector. We experimentally validate that, based on
Intersection-over-Union (IoU) and a depth discrepancy measure, the
inconsistencies between the two sets of predictions strongly correlate to the
error of the 3D object detector against ground truths. This correlation allows
us to construct a fuzzy inference system and map the inconsistency measures to
an AV collision risk indicator. In particular, we optimize the fuzzy inference
system towards an existing offline metric that matches AV collision rates well.
Lastly, we validate our monitor's capability to produce relevant risk estimates
with the large-scale nuScenes dataset and demonstrate that it can safeguard an
AV in closed-loop simulations.
comment: Accepted by ICRA 2025, 7 pages (IEEE double column format), 5
figures, 3 tables
♻ ☆ MonoForce: Learnable Image-conditioned Physics Engine
We propose a novel model for the prediction of robot trajectories on rough
offroad terrain from the onboard camera images. This model enforces the laws of
classical mechanics through a physics-aware neural symbolic layer while
preserving the ability to learn from large-scale data as it is end-to-end
differentiable. The proposed hybrid model integrates a black-box component that
predicts robot-terrain interaction forces with a neural-symbolic layer. This
layer includes a differentiable physics engine that computes the robot's
trajectory by querying these forces at the points of contact with the terrain.
As the proposed architecture comprises substantial geometrical and physics
priors, the resulting model can also be seen as a learnable physics engine
conditioned on real images that delivers $10^4$ trajectories per second. We
argue and empirically demonstrate that this architecture reduces the
sim-to-real gap and mitigates out-of-distribution sensitivity. The
differentiability, in conjunction with the rapid simulation speed, makes the
model well-suited for various applications including model predictive control,
trajectory shooting, supervised and reinforcement learning or SLAM. The codes
and data are publicly available.
comment: Code:
♻ ☆ EnvoDat: A Large-Scale Multisensory Dataset for Robotic Spatial Awareness and Semantic Reasoning in Heterogeneous Environments
Linus Nwankwo, Bjoern Ellensohn, Vedant Dave, Peter Hofer, Jan Forstner, Marlene Villneuve, Robert Galler, Elmar Rueckert
To ensure the efficiency of robot autonomy under diverse real-world
conditions, a high-quality heterogeneous dataset is essential to benchmark the
operating algorithms' performance and robustness. Current benchmarks
predominantly focus on urban terrains, specifically for on-road autonomous
driving, leaving multi-degraded, densely vegetated, dynamic and feature-sparse
environments, such as underground tunnels, natural fields, and modern indoor
spaces underrepresented. To fill this gap, we introduce EnvoDat, a large-scale,
multi-modal dataset collected in diverse environments and conditions, including
high illumination, fog, rain, and zero visibility at different times of the
day. Overall, EnvoDat contains 26 sequences from 13 scenes, 10 sensing
modalities, over 1.9TB of data, and over 89K fine-grained polygon-based
annotations for more than 82 object and terrain classes. We post-processed
EnvoDat in different formats that support benchmarking SLAM and supervised
learning algorithms, and fine-tuning multimodal vision models. With EnvoDat, we
contribute to environment-resilient robotic autonomy in areas where the
conditions are extremely challenging. The datasets and other relevant resources
can be accessed through
♻ ☆ Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment
Deep neural network models have achieved remarkable progress in 3D scene
understanding while trained in the closed-set setting and with full labels.
However, the major bottleneck is that these models do not have the capacity to
recognize any unseen novel classes beyond the training categories in diverse
real-world applications. Therefore, we are in urgent need of a framework that
can simultaneously be applicable to both 3D point cloud segmentation and
detection, particularly in the circumstances where the labels are rather
scarce. This work presents a generalized and straightforward framework for
dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language
models, we propose a hierarchical feature-aligned pre-training and knowledge
distillation strategy to extract and distill meaningful information from
large-scale vision-language models, which helps benefit the open-vocabulary
scene understanding tasks. To encourage latent instance discrimination and to
guarantee efficiency, we propose the unsupervised region-level semantic
contrastive learning scheme for point clouds, using confident predictions of
the neural network to discriminate the intermediate feature embeddings at
multiple stages. In the limited reconstruction case, our proposed approach,
termed WS3D++, ranks 1st on the large-scale ScanNet benchmark on both the task
of semantic segmentation and instance segmentation. Extensive experiments with
both indoor and outdoor scenes demonstrated the effectiveness of our approach
in both data-efficient learning and open-world few-shot learning. The code is
made publicly available at:
comment: IEEE Transactions on Pattern Analysis and Machine Intelligence,
Manuscript Info: 17 Pages, 13 Figures, and 6 Tables
♻ ☆ Towards Fusing Point Cloud and Visual Representations for Imitation Learning
Atalay Donat, Xiaogang Jia, Xi Huang, Aleksandar Taranovic, Denis Blessing, Ge Li, Hongyi Zhou, Hanyi Zhang, Rudolf Lioutikov, Gerhard Neumann
Learning for manipulation requires using policies that have access to rich
sensory information such as point clouds or RGB images. Point clouds
efficiently capture geometric structures, making them essential for
manipulation tasks in imitation learning. In contrast, RGB images provide rich
texture and semantic information that can be crucial for certain tasks.
Existing approaches for fusing both modalities assign 2D image features to
point clouds. However, such approaches often lose global contextual information
from the original images. In this work, we propose FPV-Net, a novel imitation
learning method that effectively combines the strengths of both point cloud and
RGB modalities. Our method conditions the point-cloud encoder on global and
local image tokens using adaptive layer norm conditioning, leveraging the
beneficial properties of both modalities. Through extensive experiments on the
challenging RoboCasa benchmark, we demonstrate the limitations of relying on
either modality alone and show that our method achieves state-of-the-art
performance across all tasks.
♻ ☆ X-IL: Exploring the Design Space of Imitation Learning Policies
Xiaogang Jia, Atalay Donat, Xi Huang, Xuan Zhao, Denis Blessing, Hongyi Zhou, Han A. Wang, Hanyi Zhang, Qian Wang, Rudolf Lioutikov, Gerhard Neumann
Designing modern imitation learning (IL) policies requires making numerous
decisions, including the selection of feature encoding, architecture, policy
representation, and more. As the field rapidly advances, the range of available
options continues to grow, creating a vast and largely unexplored design space
for IL policies. In this work, we present X-IL, an accessible open-source
framework designed to systematically explore this design space. The framework's
modular design enables seamless swapping of policy components, such as
backbones (e.g., Transformer, Mamba, xLSTM) and policy optimization techniques
(e.g., Score-matching, Flow-matching). This flexibility facilitates
comprehensive experimentation and has led to the discovery of novel policy
configurations that outperform existing methods on recent robot learning
benchmarks. Our experiments demonstrate not only significant performance gains
but also provide valuable insights into the strengths and weaknesses of various
design choices. This study serves as both a practical reference for
practitioners and a foundation for guiding future research in imitation
♻ ☆ Path Planning for Spot Spraying with UAVs Combining TSP and Area Coverages
This paper addresses the following task: given a set of patches or areas of
varying sizes that are meant to be serviced within a bounding contour calculate
a minimal length path plan for an unmanned aerial vehicle (UAV) such that the
path additionally avoids given obstacles areas and does never leave the
bounding contour. The application in mind is agricultural spot spraying, where
the bounding contour represents the field contour and multiple patches
represent multiple weed areas meant to be sprayed. Obstacle areas are ponds or
tree islands. The proposed method combines a heuristic solution to a traveling
salesman problem (TSP) with optimised area coverage path planning. Two
TSP-initialisation and 4 TSP-refinement heuristics as well as two area coverage
path planning methods are evaluated on three real-world experiments with three
obstacle areas and 15, 19 and 197 patches, respectively. The unsuitability of a
Boustrophedon-path for area coverage gap avoidance is discussed and inclusion
of a headland path for area coverage is motivated. Two main findings are (i)
the particular suitability of one TSP-refinement heuristic, and (ii) the
unexpected high contribution of patches areas coverage pathlengths on total
pathlength, highlighting the importance of optimised area coverage path
planning for spot spraying.
comment: 11 pages, 14 figures, 4 tables
♻ ☆ Reinforcement Learning of Multi-robot Task Allocation for Multi-object Transportation with Infeasible Tasks
Multi-object transport using multi-robot systems has the potential for
diverse practical applications such as delivery services owing to its efficient
individual and scalable cooperative transport. However, allocating
transportation tasks of objects with unknown weights remains challenging.
Moreover, the presence of infeasible tasks (untransportable objects) can lead
to robot stoppage (deadlock). This paper proposes a framework for dynamic task
allocation that involves storing task experiences for each task in a scalable
manner with respect to the number of robots. First, these experiences are
broadcasted from the cloud server to the entire robot system. Subsequently,
each robot learns the exclusion levels for each task based on those task
experiences, enabling it to exclude infeasible tasks and reset its task
priorities. Finally, individual transportation, cooperative transportation, and
the temporary exclusion of tasks considered infeasible are achieved. The
scalability and versatility of the proposed method were confirmed through
numerical experiments with an increased number of robots and objects, including
unlearned weight objects. The effectiveness of the temporary deadlock avoidance
was also confirmed by introducing additional robots within an episode. The
proposed method enables the implementation of task allocation strategies that
are feasible for different numbers of robots and various transport tasks
without prior consideration of feasibility.
comment: 8 pages, 10 figures
♻ ☆ BFA: Best-Feature-Aware Fusion for Multi-View Fine-grained Manipulation
In real-world scenarios, multi-view cameras are typically employed for
fine-grained manipulation tasks. Existing approaches (e.g., ACT) tend to treat
multi-view features equally and directly concatenate them for policy learning.
However, it will introduce redundant visual information and bring higher
computational costs, leading to ineffective manipulation. For a fine-grained
manipulation task, it tends to involve multiple stages while the most
contributed view for different stages is varied over time. In this paper, we
propose a plug-and-play best-feature-aware (BFA) fusion strategy for multi-view
manipulation tasks, which is adaptable to various policies. Built upon the
visual backbone of the policy network, we design a lightweight network to
predict the importance score of each view. Based on the predicted importance
scores, the reweighted multi-view features are subsequently fused and input
into the end-to-end policy network, enabling seamless integration. Notably, our
method demonstrates outstanding performance in fine-grained manipulations.
Experimental results show that our approach outperforms multiple baselines by
22-46% success rate on different tasks. Our work provides new insights and
inspiration for tackling key challenges in fine-grained manipulations.
comment: 8 pages, 4 figures
♻ ☆ CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision
Teaching robots desired skills in real-world environments remains
challenging, especially for non-experts. The reliance on specialized expertise
in robot control and teleoperation systems often limits accessibility to
non-experts. We posit that natural language offers an intuitive and accessible
interface for robot learning. To this end, we study two aspects: (1) enabling
non-experts to collect robotic data through natural language supervision (e.g.,
"move the arm to the right") and (2) learning robotic policies directly from
this supervision. Specifically, we introduce a data collection framework that
collects robot demonstrations based on natural language supervision and further
augments these demonstrations. We then present CLIP-RT, a
vision-language-action (VLA) model that learns language-conditioned visuomotor
policies from this supervision. CLIP-RT adapts the pretrained CLIP models and
learns to predict language-based motion primitives via contrastive imitation
learning. We train CLIP-RT on the Open X-Embodiment dataset and finetune it on
in-domain data collected by our framework to learn diverse skills. CLIP-RT
demonstrates strong capabilities in learning novel manipulation skills,
outperforming the state-of-the-art model, OpenVLA (7B parameters), by 24% in
average success rates, while using 7x fewer parameters (1B). We further observe
that CLIP-RT shows significant improvements in few-shot generalization.
Finally, through collaboration with humans or large pretrained models, we
demonstrate that CLIP-RT can further improve its generalization on challenging
comment: 27 pages
♻ ☆ Functional Eigen-Grasping Using Approach Heatmaps
This work presents a framework for a robot with a multi-fingered hand to
freely utilize daily tools, including functional parts like buttons and
triggers. An approach heatmap is generated by selecting a functional finger,
indicating optimal palm positions on the object's surface that enable the
functional finger to contact the tool's functional part. Once the palm position
is identified through the heatmap, achieving the functional grasp becomes a
straightforward process where the fingers stably grasp the object with
low-dimensional inputs using the eigengrasp. As our approach does not need
human demonstrations, it can easily adapt to various sizes and designs,
extending its applicability to different objects. In our approach, we use
directional manipulability to obtain the approach heatmap. In addition, we add
two kinds of energy functions, i.e., palm energy and functional energy
functions, to realize the eigengrasp. Using this method, each robotic gripper
can autonomously identify its optimal workspace for functional grasping,
extending its applicability to non-anthropomorphic robotic hands. We show that
several daily tools like spray, drill, and remotes can be efficiently used by
not only an anthropomorphic Shadow hand but also a non-anthropomorphic Barrett
comment: 8 pages, 7 figures
♻ ☆ Generalizable Humanoid Manipulation with 3D Diffusion Policies
Humanoid robots capable of autonomous operation in diverse environments have
long been a goal for roboticists. However, autonomous manipulation by humanoid
robots has largely been restricted to one specific scene, primarily due to the
difficulty of acquiring generalizable skills and the expensiveness of
in-the-wild humanoid robot data. In this work, we build a real-world robotic
system to address this challenging problem. Our system is mainly an integration
of 1) a whole-upper-body robotic teleoperation system to acquire human-like
robot data, 2) a 25-DoF humanoid robot platform with a height-adjustable cart
and a 3D LiDAR sensor, and 3) an improved 3D Diffusion Policy learning
algorithm for humanoid robots to learn from noisy human data. We run more than
2000 episodes of policy rollouts on the real robot for rigorous policy
evaluation. Empowered by this system, we show that using only data collected in
one single scene and with only onboard computing, a full-sized humanoid robot
can autonomously perform skills in diverse real-world scenarios. Videos are
available at
comment: Project website:
♻ ☆ A Space-Efficient Algebraic Approach to Robotic Motion Planning
Matthias Bentert, Daniel Coimbra Salomao, Alex Crane, Yosuke Mizutani, Felix Reidl, Blair D. Sullivan
We consider efficient route planning for robots in applications such as
infrastructure inspection and automated surgical imaging. These tasks can be
modeled via the combinatorial problem Graph Inspection. The best known
algorithms for this problem are limited in practice by exponential space
complexity. In this paper, we develop a memory-efficient approach using
algebraic tools related to monomial testing on the polynomials associated with
certain arithmetic circuits. Our contributions are two-fold. We first repair a
minor flaw in existing work on monomial detection using a new approach we call
tree certificates. We further show that, in addition to detection, these tools
allow us to efficiently recover monomials of interest from circuits, opening
the door for significantly broadened application of related algebraic tools.
For Graph Inspection, we design and evaluate a complete algebraic pipeline. Our
engineered implementation demonstrates that circuit-based algorithms are indeed
memory-efficient in practice, thus encouraging further engineering efforts.
♻ ☆ LEGATO: Cross-Embodiment Imitation Using a Grasping Tool
Cross-embodiment imitation learning enables policies trained on specific
embodiments to transfer across different robots, unlocking the potential for
large-scale imitation learning that is both cost-effective and highly reusable.
This paper presents LEGATO, a cross-embodiment imitation learning framework for
visuomotor skill transfer across varied kinematic morphologies. We introduce a
handheld gripper that unifies action and observation spaces, allowing tasks to
be defined consistently across robots. We train visuomotor policies on task
demonstrations using this gripper through imitation learning, applying
transformation to a motion-invariant space for computing the training loss.
Gripper motions generated by the policies are retargeted into
high-degree-of-freedom whole-body motions using inverse kinematics for
deployment across diverse embodiments. Our evaluations in simulation and
real-robot experiments highlight the framework's effectiveness in learning and
transferring visuomotor skills across various robots. More information can be
found on the project page:
comment: Published in RA-L