Robotics 62
☆ DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control
Imitation learning has proven to be a powerful tool for training complex
visuomotor policies. However, current methods often require hundreds to
thousands of expert demonstrations to handle high-dimensional visual
observations. A key reason for this poor data efficiency is that visual
representations are predominantly either pretrained on out-of-domain data or
trained directly through a behavior cloning objective. In this work, we present
DynaMo, a new in-domain, self-supervised method for learning visual
representations. Given a set of expert demonstrations, we jointly learn a
latent inverse dynamics model and a forward dynamics model over a sequence of
image embeddings, predicting the next frame in latent space, without
augmentations, contrastive sampling, or access to ground truth actions.
Importantly, DynaMo does not require any out-of-domain data such as Internet
datasets or cross-embodied datasets. On a suite of six simulated and real
environments, we show that representations learned with DynaMo significantly
improve downstream imitation learning performance over prior self-supervised
learning objectives, and pretrained representations. Gains from using DynaMo
hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP,
and nearest neighbors. Finally, we ablate over key components of DynaMo and
measure its impact on downstream policy performance. Robot videos are best
viewed at https://dynamo-ssl.github.io
☆ Bundle Adjustment in the Eager Mode
Bundle adjustment (BA) is a critical technique in various robotic
applications, such as simultaneous localization and mapping (SLAM), augmented
reality (AR), and photogrammetry. BA optimizes parameters such as camera poses
and 3D landmarks to align them with observations. With the growing importance
of deep learning in perception systems, there is an increasing need to
integrate BA with deep learning frameworks for enhanced reliability and
performance. However, widely-used C++-based BA frameworks, such as GTSAM,
g$^2$o, and Ceres, lack native integration with modern deep learning libraries
like PyTorch. This limitation affects their flexibility, adaptability, ease of
debugging, and overall implementation efficiency. To address this gap, we
introduce an eager-mode BA framework seamlessly integrated with PyPose,
providing PyTorch-compatible interfaces with high efficiency. Our approach
includes GPU-accelerated, differentiable, and sparse operations designed for
2nd-order optimization, Lie group and Lie algebra operations, and linear
solvers. Our eager-mode BA on GPU demonstrates substantial runtime efficiency,
achieving an average speedup of 18.5$\times$, 22$\times$, and 23$\times$
compared to GTSAM, g$^2$o, and Ceres, respectively.
☆ WeHelp: A Shared Autonomy System for Wheelchair Users
There is a large population of wheelchair users. Most of the wheelchair users
need help with daily tasks. However, according to recent reports, their needs
are not properly satisfied due to the lack of caregivers. Therefore, in this
project, we develop WeHelp, a shared autonomy system aimed for wheelchair
users. A robot with a WeHelp system has three modes, following mode, remote
control mode and tele-operation mode. In the following mode, the robot follows
the wheelchair user automatically via visual tracking. The wheelchair user can
ask the robot to follow them from behind, by the left or by the right. When the
wheelchair user asks for help, the robot will recognize the command via speech
recognition, and then switch to the teleoperation mode or remote control mode.
In the teleoperation mode, the wheelchair user takes over the robot with a joy
stick and controls the robot to complete some complex tasks for their needs,
such as opening doors, moving obstacles on the way, reaching objects on a high
shelf or on the low ground, etc. In the remote control mode, a remote assistant
takes over the robot and helps the wheelchair user complete some complex tasks
for their needs. Our evaluation shows that the pipeline is useful and practical
for wheelchair users. Source code and demo of the paper are available at
\url{https://github.com/Walleclipse/WeHelp}.
☆ Robots that Learn to Safely Influence via Prediction-Informed Reach-Avoid Dynamic Games
Robots can influence people to accomplish their tasks more efficiently:
autonomous cars can inch forward at an intersection to pass through, and
tabletop manipulators can go for an object on the table first. However, a
robot's ability to influence can also compromise the safety of nearby people if
naively executed. In this work, we pose and solve a novel robust reach-avoid
dynamic game which enables robots to be maximally influential, but only when a
safety backup control exists. On the human side, we model the human's behavior
as goal-driven but conditioned on the robot's plan, enabling us to capture
influence. On the robot side, we solve the dynamic game in the joint physical
and belief space, enabling the robot to reason about how its uncertainty in
human behavior will evolve over time. We instantiate our method, called SLIDE
(Safely Leveraging Influence in Dynamic Environments), in a high-dimensional
(39-D) simulated human-robot collaborative manipulation task solved via offline
game-theoretic reinforcement learning. We compare our approach to a robust
baseline that treats the human as a worst-case adversary, a safety controller
that does not explicitly reason about influence, and an energy-function-based
safety shield. We find that SLIDE consistently enables the robot to leverage
the influence it has on the human when it is safe to do so, ultimately allowing
the robot to be less conservative while still ensuring a high safety rate
during task execution.
☆ Residual Descent Differential Dynamic Game (RD3G) -- A Fast Newton Solver for Constrained General Sum Games
We present Residual Descent Differential Dynamic Game (RD3G), a Newton-based
solver for constrained multi-agent game-control problems. The proposed solver
seeks a local Nash equilibrium for problems where agents are coupled through
their rewards and state constraints. We compare the proposed method against
competing state-of-the-art techniques and showcase the computational benefits
of the RD3G algorithm on several example problems.
☆ Bi-objective trail-planning for a robot team orienteering in a hazardous environment
Teams of mobile [aerial, ground, or aquatic] robots have applications in
resource delivery, patrolling, information-gathering, agriculture, forest fire
fighting, chemical plume source localization and mapping, and
search-and-rescue. Robot teams traversing hazardous environments -- with e.g.
rough terrain or seas, strong winds, or adversaries capable of attacking or
capturing robots -- should plan and coordinate their trails in consideration of
risks of disablement, destruction, or capture. Specifically, the robots should
take the safest trails, coordinate their trails to cooperatively achieve the
team-level objective with robustness to robot failures, and balance the reward
from visiting locations against risks of robot losses. Herein, we consider
bi-objective trail-planning for a mobile team of robots orienteering in a
hazardous environment. The hazardous environment is abstracted as a directed
graph whose arcs, when traversed by a robot, present known probabilities of
survival. Each node of the graph offers a reward to the team if visited by a
robot (which e.g. delivers a good to or images the node). We wish to search for
the Pareto-optimal robot-team trail plans that maximize two [conflicting] team
objectives: the expected (i) team reward and (ii) number of robots that survive
the mission. A human decision-maker can then select trail plans that balance,
according to their values, reward and robot survival. We implement ant colony
optimization, guided by heuristics, to search for the Pareto-optimal set of
robot team trail plans. As a case study, we illustrate with an
information-gathering mission in an art museum.
comment: v0.0
☆ An Efficient Projection-Based Next-best-view Planning Framework for Reconstruction of Unknown Objects
Efficiently and completely capturing the three-dimensional data of an object
is a fundamental problem in industrial and robotic applications. The task of
next-best-view (NBV) planning is to infer the pose of the next viewpoint based
on the current data, and gradually realize the complete three-dimensional
reconstruction. Many existing algorithms, however, suffer a large computational
burden due to the use of ray-casting. To address this, this paper proposes a
projection-based NBV planning framework. It can select the next best view at an
extremely fast speed while ensuring the complete scanning of the object.
Specifically, this framework refits different types of voxel clusters into
ellipsoids based on the voxel structure.Then, the next best view is selected
from the candidate views using a projection-based viewpoint quality evaluation
function in conjunction with a global partitioning strategy. This process
replaces the ray-casting in voxel structures, significantly improving the
computational efficiency. Comparative experiments with other algorithms in a
simulation environment show that the framework proposed in this paper can
achieve 10 times efficiency improvement on the basis of capturing roughly the
same coverage. The real-world experimental results also prove the efficiency
and feasibility of the framework.
☆ A machine learning framework for acoustic reflector mapping
Sonar-based indoor mapping systems have been widely employed in robotics for
several decades. While such systems are still the mainstream in underwater and
pipe inspection settings, the vulnerability to noise reduced, over time, their
general widespread usage in favour of other modalities(\textit{e.g.}, cameras,
lidars), whose technologies were encountering, instead, extraordinary
advancements. Nevertheless, mapping physical environments using acoustic
signals and echolocation can bring significant benefits to robot navigation in
adverse scenarios, thanks to their complementary characteristics compared to
other sensors. Cameras and lidars, indeed, struggle in harsh weather
conditions, when dealing with lack of illumination, or with non-reflective
walls. Yet, for acoustic sensors to be able to generate accurate maps, noise
has to be properly and effectively handled. Traditional signal processing
techniques are not always a solution in those cases. In this paper, we propose
a framework where machine learning is exploited to aid more traditional signal
processing methods to cope with background noise, by removing outliers and
artefacts from the generated maps using acoustic sensors. Our goal is to
demonstrate that the performance of traditional echolocation mapping techniques
can be greatly enhanced, even in particularly noisy conditions, facilitating
the employment of acoustic sensors in state-of-the-art multi-modal robot
navigation systems. Our simulated evaluation demonstrates that the system can
reliably operate at an SNR of $-10$dB. Moreover, we also show that the proposed
method is capable of operating in different reverberate environments. In this
paper, we also use the proposed method to map the outline of a simulated room
using a robotic platform.
☆ IMRL: Integrating Visual, Physical, Temporal, and Geometric Representations for Enhanced Food Acquisition
Robotic assistive feeding holds significant promise for improving the quality
of life for individuals with eating disabilities. However, acquiring diverse
food items under varying conditions and generalizing to unseen food presents
unique challenges. Existing methods that rely on surface-level geometric
information (e.g., bounding box and pose) derived from visual cues (e.g.,
color, shape, and texture) often lacks adaptability and robustness, especially
when foods share similar physical properties but differ in visual appearance.
We employ imitation learning (IL) to learn a policy for food acquisition.
Existing methods employ IL or Reinforcement Learning (RL) to learn a policy
based on off-the-shelf image encoders such as ResNet-50. However, such
representations are not robust and struggle to generalize across diverse
acquisition scenarios. To address these limitations, we propose a novel
approach, IMRL (Integrated Multi-Dimensional Representation Learning), which
integrates visual, physical, temporal, and geometric representations to enhance
the robustness and generalizability of IL for food acquisition. Our approach
captures food types and physical properties (e.g., solid, semi-solid, granular,
liquid, and mixture), models temporal dynamics of acquisition actions, and
introduces geometric information to determine optimal scooping points and
assess bowl fullness. IMRL enables IL to adaptively adjust scooping strategies
based on context, improving the robot's capability to handle diverse food
acquisition scenarios. Experiments on a real robot demonstrate our approach's
robustness and adaptability across various foods and bowl configurations,
including zero-shot generalization to unseen settings. Our approach achieves
improvement up to $35\%$ in success rate compared with the best-performing
baseline.
☆ Online Refractive Camera Model Calibration in Visual Inertial Odometry IROS 2024
This paper presents a general refractive camera model and online
co-estimation of odometry and the refractive index of unknown media. This
enables operation in diverse and varying refractive fluids, given only the
camera calibration in air. The refractive index is estimated online as a state
variable of a monocular visual-inertial odometry framework in an iterative
formulation using the proposed camera model. The method was verified on data
collected using an underwater robot traversing inside a pool. The evaluations
demonstrate convergence to the ideal refractive index for water despite
significant perturbations in the initialization. Simultaneously, the approach
enables on-par visual-inertial odometry performance in refractive media without
prior knowledge of the refractive index or requirement of medium-specific
camera calibration.
comment: Accepted at the 2024 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2024), 8 pages
☆ Generalized Robot Learning Framework
Imitation based robot learning has recently gained significant attention in
the robotics field due to its theoretical potential for transferability and
generalizability. However, it remains notoriously costly, both in terms of
hardware and data collection, and deploying it in real-world environments
demands meticulous setup of robots and precise experimental conditions. In this
paper, we present a low-cost robot learning framework that is both easily
reproducible and transferable to various robots and environments. We
demonstrate that deployable imitation learning can be successfully applied even
to industrial-grade robots, not just expensive collaborative robotic arms.
Furthermore, our results show that multi-task robot learning is achievable with
simple network architectures and fewer demonstrations than previously thought
necessary. As the current evaluating method is almost subjective when it comes
to real-world manipulation tasks, we propose Voting Positive Rate (VPR) - a
novel evaluation strategy that provides a more objective assessment of
performance. We conduct an extensive comparison of success rates across various
self-designed tasks to validate our approach. To foster collaboration and
support the robot learning community, we have open-sourced all relevant
datasets and model checkpoints, available at huggingface.co/ZhiChengAI.
comment: 6 pages, 2 figures. cs.RO
☆ Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping
We propose visual-inertial simultaneous localization and mapping that tightly
couples sparse reprojection errors, inertial measurement unit pre-integrals,
and relative pose factors with dense volumetric occupancy mapping. Hereby depth
predictions from a deep neural network are fused in a fully probabilistic
manner. Specifically, our method is rigorously uncertainty-aware: first, we use
depth and uncertainty predictions from a deep network not only from the robot's
stereo rig, but we further probabilistically fuse motion stereo that provides
depth information across a range of baselines, therefore drastically increasing
mapping accuracy. Next, predicted and fused depth uncertainty propagates not
only into occupancy probabilities but also into alignment factors between
generated dense submaps that enter the probabilistic nonlinear least squares
estimator. This submap representation offers globally consistent geometry at
scale. Our method is thoroughly evaluated in two benchmark datasets, resulting
in localization and mapping accuracy that exceeds the state of the art, while
simultaneously offering volumetric occupancy directly usable for downstream
robotic planning and control in real-time.
comment: 7 pages, 4 figures, 5 tables, conference
☆ Handling Long-Term Safety and Uncertainty in Safe Reinforcement Learning
Safety is one of the key issues preventing the deployment of reinforcement
learning techniques in real-world robots. While most approaches in the Safe
Reinforcement Learning area do not require prior knowledge of constraints and
robot kinematics and rely solely on data, it is often difficult to deploy them
in complex real-world settings. Instead, model-based approaches that
incorporate prior knowledge of the constraints and dynamics into the learning
framework have proven capable of deploying the learning algorithm directly on
the real robot. Unfortunately, while an approximated model of the robot
dynamics is often available, the safety constraints are task-specific and hard
to obtain: they may be too complicated to encode analytically, too expensive to
compute, or it may be difficult to envision a priori the long-term safety
requirements. In this paper, we bridge this gap by extending the safe
exploration method, ATACOM, with learnable constraints, with a particular focus
on ensuring long-term safety and handling of uncertainty. Our approach is
competitive or superior to state-of-the-art methods in final performance while
maintaining safer behavior during training.
☆ Panoptic-Depth Forecasting
Forecasting the semantics and 3D structure of scenes is essential for robots
to navigate and plan actions safely. Recent methods have explored semantic and
panoptic scene forecasting; however, they do not consider the geometry of the
scene. In this work, we propose the panoptic-depth forecasting task for jointly
predicting the panoptic segmentation and depth maps of unobserved future
frames, from monocular camera images. To facilitate this work, we extend the
popular KITTI-360 and Cityscapes benchmarks by computing depth maps from LiDAR
point clouds and leveraging sequential labeled data. We also introduce a
suitable evaluation metric that quantifies both the panoptic quality and depth
estimation accuracy of forecasts in a coherent manner. Furthermore, we present
two baselines and propose the novel PDcast architecture that learns rich
spatio-temporal representations by incorporating a transformer-based encoder, a
forecasting module, and task-specific decoders to predict future panoptic-depth
outputs. Extensive evaluations demonstrate the effectiveness of PDcast across
two datasets and three forecasting tasks, consistently addressing the primary
challenges. We make the code publicly available at
https://pdcast.cs.uni-freiburg.de.
☆ Real-Time-Feasible Collision-Free Motion Planning For Ellipsoidal Objects
Online planning of collision-free trajectories is a fundamental task for
robotics and self-driving car applications. This paper revisits collision
avoidance between ellipsoidal objects using differentiable constraints. Two
ellipsoids do not overlap if and only if the endpoint of the vector between the
center points of the ellipsoids does not lie in the interior of the Minkowski
sum of the ellipsoids. This condition is formulated using a parametric
over-approximation of the Minkowski sum, which can be made tight in any given
direction. The resulting collision avoidance constraint is included in an
optimal control problem (OCP) and evaluated in comparison to the
separating-hyperplane approach. Not only do we observe that the Minkowski-sum
formulation is computationally more efficient in our experiments, but also that
using pre-determined over-approximation parameters based on warm-start
trajectories leads to a very limited increase in suboptimality. This gives rise
to a novel real-time scheme for collision-free motion planning with model
predictive control (MPC). Both the real-time feasibility and the effectiveness
of the constraint formulation are demonstrated in challenging real-world
experiments.
☆ Representing Positional Information in Generative World Models for Object Manipulation
Object manipulation capabilities are essential skills that set apart embodied
agents engaging with the world, especially in the realm of robotics. The
ability to predict outcomes of interactions with objects is paramount in this
setting. While model-based control methods have started to be employed for
tackling manipulation tasks, they have faced challenges in accurately
manipulating objects. As we analyze the causes of this limitation, we identify
the cause of underperformance in the way current world models represent crucial
positional information, especially about the target's goal specification for
object positioning tasks. We introduce a general approach that empowers world
model-based agents to effectively solve object-positioning tasks. We propose
two declinations of this approach for generative world models:
position-conditioned (PCP) and latent-conditioned (LCP) policy learning. In
particular, LCP employs object-centric latent representations that explicitly
capture object positional information for goal specification. This naturally
leads to the emergence of multimodal capabilities, enabling the specification
of goals through spatial coordinates or a visual goal. Our methods are
rigorously evaluated across several manipulation environments, showing
favorable performance compared to current model-based control approaches.
☆ Towards Global Localization using Multi-Modal Object-Instance Re-Identification ICRA 2025
Aneesh Chavan, Vaibhav Agrawal, Vineeth Bhat, Sarthak Chittawar, Siddharth Srivastava, Chetan Arora, K Madhava Krishna
Re-identification (ReID) is a critical challenge in computer vision,
predominantly studied in the context of pedestrians and vehicles. However,
robust object-instance ReID, which has significant implications for tasks such
as autonomous exploration, long-term perception, and scene understanding,
remains underexplored. In this work, we address this gap by proposing a novel
dual-path object-instance re-identification transformer architecture that
integrates multimodal RGB and depth information. By leveraging depth data, we
demonstrate improvements in ReID across scenes that are cluttered or have
varying illumination conditions. Additionally, we develop a ReID-based
localization framework that enables accurate camera localization and pose
identification across different viewpoints. We validate our methods using two
custom-built RGB-D datasets, as well as multiple sequences from the open-source
TUM RGB-D datasets. Our approach demonstrates significant improvements in both
object instance ReID (mAP of 75.18) and localization accuracy (success rate of
83% on TUM-RGBD), highlighting the essential role of object ReID in advancing
robotic perception. Our models, frameworks, and datasets have been made
publicly available.
comment: 8 pages, 5 figures, 3 tables. Submitted to ICRA 2025
☆ LMMCoDrive: Cooperative Driving with Large Multimodal Model
To address the intricate challenges of decentralized cooperative scheduling
and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper
introduces LMMCoDrive, a novel cooperative driving framework that leverages a
Large Multimodal Model (LMM) to enhance traffic efficiency in dynamic urban
environments. This framework seamlessly integrates scheduling and motion
planning processes to ensure the effective operation of Cooperative Autonomous
Vehicles (CAVs). The spatial relationship between CAVs and passenger requests
is abstracted into a Bird's-Eye View (BEV) to fully exploit the potential of
the LMM. Besides, trajectories are cautiously refined for each CAV while
ensuring collision avoidance through safety constraints. A decentralized
optimization strategy, facilitated by the Alternating Direction Method of
Multipliers (ADMM) within the LMM framework, is proposed to drive the graph
evolution of CAVs. Simulation results demonstrate the pivotal role and
significant impact of LMM in optimizing CAV scheduling and enhancing
decentralized cooperative optimization process for each vehicle. This marks a
substantial stride towards achieving practical, efficient, and safe AMoD
systems that are poised to revolutionize urban transportation. The code is
available at https://github.com/henryhcliu/LMMCoDrive.
comment: 7 pages, 5 figures
☆ Particle-based Instance-aware Semantic Occupancy Mapping in Dynamic Environments
Representing the 3D environment with instance-aware semantic and geometric
information is crucial for interaction-aware robots in dynamic environments.
Nonetheless, creating such a representation poses challenges due to sensor
noise, instance segmentation and tracking errors, and the objects' dynamic
motion. This paper introduces a novel particle-based instance-aware semantic
occupancy map to tackle these challenges. Particles with an augmented instance
state are used to estimate the Probability Hypothesis Density (PHD) of the
objects and implicitly model the environment. Utilizing a State-augmented
Sequential Monte Carlo PHD (S$^2$MC-PHD) filter, these particles are updated to
jointly estimate occupancy status, semantic, and instance IDs, mitigating
noise. Additionally, a memory module is adopted to enhance the map's
responsiveness to previously observed objects. Experimental results on the
Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses
state-of-the-art methods across multiple metrics under different noise
conditions. Subsequent tests using real-world data further validate the
effectiveness of the proposed approach.
☆ Metric-Semantic Factor Graph Generation based on Graph Neural Networks ICRA 2025
Understanding the relationships between geometric structures and semantic
concepts is crucial for building accurate models of complex environments. In
indoors, certain spatial constraints, such as the relative positioning of
planes, remain consistent despite variations in layout. This paper explores how
these invariant relationships can be captured in a graph SLAM framework by
representing high-level concepts like rooms and walls, linking them to
geometric elements like planes through an optimizable factor graph. Several
efforts have tackled this issue with add-hoc solutions for each concept
generation and with manually-defined factors.
This paper proposes a novel method for metric-semantic factor graph
generation which includes defining a semantic scene graph, integrating
geometric information, and learning the interconnecting factors, all based on
Graph Neural Networks (GNNs). An edge classification network (G-GNN) sorts the
edges between planes into same room, same wall or none types. The resulting
relations are clustered, generating a room or wall for each cluster. A second
family of networks (F-GNN) infers the geometrical origin of the new nodes. The
definition of the factors employs the same F-GNN used for the metric attribute
of the generated nodes. Furthermore, share the new factor graph with the
S-Graphs+ algorithm, extending its graph expressiveness and scene
representation with the ultimate goal of improving the SLAM performance. The
complexity of the environments is increased to N-plane rooms by training the
networks on L-shaped rooms. The framework is evaluated in synthetic and
simulated scenarios as no real datasets of the required complex layouts are
available.
comment: Submitted to ICRA 2025
☆ Reactive Collision Avoidance for Safe Agile Navigation
Reactive collision avoidance is essential for agile robots navigating complex
and dynamic environments, enabling real-time obstacle response. However, this
task is inherently challenging because it requires a tight integration of
perception, planning, and control, which traditional methods often handle
separately, resulting in compounded errors and delays. This paper introduces a
novel approach that unifies these tasks into a single reactive framework using
solely onboard sensing and computing. Our method combines nonlinear model
predictive control with adaptive control barrier functions, directly linking
perception-driven constraints to real-time planning and control. Constraints
are determined by using a neural network to refine noisy RGB-D data, enhancing
depth accuracy, and selecting points with the minimum time-to-collision to
prioritize the most immediate threats. To maintain a balance between safety and
agility, a heuristic dynamically adjusts the optimization process, preventing
overconstraints in real time. Extensive experiments with an agile quadrotor
demonstrate effective collision avoidance across diverse indoor and outdoor
environments, without requiring environment-specific tuning or explicit
mapping.
☆ Human-Robot Cooperative Piano Playing with Learning-Based Real-Time Music Accompaniment
Recent advances in machine learning have paved the way for the development of
musical and entertainment robots. However, human-robot cooperative instrument
playing remains a challenge, particularly due to the intricate motor
coordination and temporal synchronization. In this paper, we propose a
theoretical framework for human-robot cooperative piano playing based on
non-verbal cues. First, we present a music improvisation model that employs a
recurrent neural network (RNN) to predict appropriate chord progressions based
on the human's melodic input. Second, we propose a behavior-adaptive controller
to facilitate seamless temporal synchronization, allowing the cobot to generate
harmonious acoustics. The collaboration takes into account the bidirectional
information flow between the human and robot. We have developed an
entropy-based system to assess the quality of cooperation by analyzing the
impact of different communication modalities during human-robot collaboration.
Experiments demonstrate that our RNN-based improvisation can achieve a 93\%
accuracy rate. Meanwhile, with the MPC adaptive controller, the robot could
respond to the human teammate in homophony performances with real-time
accompaniment. Our designed framework has been validated to be effective in
allowing humans and robots to work collaboratively in the artistic
piano-playing task.
comment: 20 pages
☆ GauTOAO: Gaussian-based Task-Oriented Affordance of Objects
When your robot grasps an object using dexterous hands or grippers, it should
understand the Task-Oriented Affordances of the Object(TOAO), as different
tasks often require attention to specific parts of the object. To address this
challenge, we propose GauTOAO, a Gaussian-based framework for Task-Oriented
Affordance of Objects, which leverages vision-language models in a zero-shot
manner to predict affordance-relevant regions of an object, given a natural
language query. Our approach introduces a new paradigm: "static camera, moving
object," allowing the robot to better observe and understand the object in hand
during manipulation. GauTOAO addresses the limitations of existing methods,
which often lack effective spatial grouping, by extracting a comprehensive 3D
object mask using DINO features. This mask is then used to conditionally query
gaussians, producing a refined semantic distribution over the object for the
specified task. This approach results in more accurate TOAO extraction,
enhancing the robot's understanding of the object and improving task
performance. We validate the effectiveness of GauTOAO through real-world
experiments, demonstrating its capability to generalize across various tasks.
comment: 6 pages
☆ Reinforcement Learning with Lie Group Orientations for Robotics ICRA 2025
Handling orientations of robots and objects is a crucial aspect of many
applications. Yet, ever so often, there is a lack of mathematical correctness
when dealing with orientations, especially in learning pipelines involving, for
example, artificial neural networks. In this paper, we investigate
reinforcement learning with orientations and propose a simple modification of
the network's input and output that adheres to the Lie group structure of
orientations. As a result, we obtain an easy and efficient implementation that
is directly usable with existing learning libraries and achieves significantly
better performance than other common orientation representations. We briefly
introduce Lie theory specifically for orientations in robotics to motivate and
outline our approach. Subsequently, a thorough empirical evaluation of
different combinations of orientation representations for states and actions
demonstrates the superior performance of our proposed approach in different
scenarios, including: direct orientation control, end effector orientation
control, and pick-and-place tasks.
comment: Submitted to ICRA 2025
☆ Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR ICRA 2025
Robotic manipulation is essential for the widespread adoption of robots in
industrial and home settings and has long been a focus within the robotics
community. Advances in artificial intelligence have introduced promising
learning-based methods to address this challenge, with imitation learning
emerging as particularly effective. However, efficiently acquiring high-quality
demonstrations remains a challenge. In this work, we introduce an immersive
VR-based teleoperation setup designed to collect demonstrations from a remote
human user. We also propose an imitation learning framework called Haptic
Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we
conducted a pick-and-place task and collected 50 demonstration episodes.
Results indicate that the immersive VR platform significantly reduces
demonstrator fingertip forces compared to systems without haptic feedback,
enabling more delicate manipulation. Additionally, evaluations of the
Haptic-ACT framework in both the MuJoCo simulator and on a real robot
demonstrate its effectiveness in teaching robots more compliant manipulation
compared to the original ACT. Additional materials are available at
https://sites.google.com/view/hapticact.
comment: This work is under review by ICRA 2025
☆ Repeatable Energy-Efficient Perching for Flapping-Wing Robots Using Soft Grippers
With the emergence of new flapping-wing micro aerial vehicle (FWMAV) designs,
a need for extensive and advanced mission capabilities arises. FWMAVs try to
adapt and emulate the flight features of birds and flying insects. While
current designs already achieve high manoeuvrability, they still almost
entirely lack perching and take-off abilities. These capabilities could, for
instance, enable long-term monitoring and surveillance missions, and operations
in cluttered environments or in proximity to humans and animals. We present the
development and testing of a framework that enables repeatable perching and
take-off for small to medium-sized FWMAVs, utilising soft, non-damaging
grippers. Thanks to its novel active-passive actuation system, an
energy-conserving state can be achieved and indefinitely maintained while the
vehicle is perched. A prototype of the proposed system weighing under 39 g was
manufactured and extensively tested on a 110 g flapping-wing robot. Successful
free-flight tests demonstrated the full mission cycle of landing, perching and
subsequent take-off. The telemetry data recorded during the flights yields
extensive insight into the system's behaviour and is a valuable step towards
full automation and optimisation of the entire take-off and landing cycle.
comment: 17 pages, 13 figures, 5 multimedia extensions
☆ Fusion in Context: A Multimodal Approach to Affective State Recognition
Accurate recognition of human emotions is a crucial challenge in affective
computing and human-robot interaction (HRI). Emotional states play a vital role
in shaping behaviors, decisions, and social interactions. However, emotional
expressions can be influenced by contextual factors, leading to
misinterpretations if context is not considered. Multimodal fusion, combining
modalities like facial expressions, speech, and physiological signals, has
shown promise in improving affect recognition. This paper proposes a
transformer-based multimodal fusion approach that leverages facial thermal
data, facial action units, and textual context information for context-aware
emotion recognition. We explore modality-specific encoders to learn tailored
representations, which are then fused using additive fusion and processed by a
shared transformer encoder to capture temporal dependencies and interactions.
The proposed method is evaluated on a dataset collected from participants
engaged in a tangible tabletop Pacman game designed to induce various affective
states. Our results demonstrate the effectiveness of incorporating contextual
information and multimodal fusion for affective state recognition.
☆ AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots
Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li
This paper presents AlignBot, a novel framework designed to optimize
VLM-powered customized task planning for household robots by effectively
aligning with user reminders. In domestic settings, aligning task planning with
user reminders poses significant challenges due to the limited quantity,
diversity, and multimodal nature of the reminders. To address these challenges,
AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for
GPT-4o. This adapter model internalizes diverse forms of user reminders-such as
personalized preferences, corrective guidance, and contextual assistance-into
structured instruction-formatted cues that prompt GPT-4o in generating
customized task plans. Additionally, AlignBot integrates a dynamic retrieval
mechanism that selects task-relevant historical successes as prompts for
GPT-4o, further enhancing task planning accuracy. To validate the effectiveness
of AlignBot, experiments are conducted in real-world household environments,
which are constructed within the laboratory to replicate typical household
settings. A multimodal dataset with over 1,500 entries derived from volunteer
reminders is used for training and evaluation. The results demonstrate that
AlignBot significantly improves customized task planning, outperforming
existing LLM- and VLM-powered planners by interpreting and aligning with user
reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline
at 21.6%, reflecting a 65% improvement and over four times greater
effectiveness. Supplementary materials are available at:
https://yding25.com/AlignBot/
☆ Secure Control Systems for Autonomous Quadrotors against Cyber-Attacks
The problem of safety for robotic systems has been extensively studied.
However, little attention has been given to security issues for
three-dimensional systems, such as quadrotors. Malicious adversaries can
compromise robot sensors and communication networks, causing incidents,
achieving illegal objectives, or even injuring people. This study first designs
an intelligent control system for autonomous quadrotors. Then, it investigates
the problems of optimal false data injection attack scheduling and
countermeasure design for unmanned aerial vehicles. Using a state-of-the-art
deep learning-based approach, an optimal false data injection attack scheme is
proposed to deteriorate a quadrotor's tracking performance with limited attack
energy. Subsequently, an optimal tracking control strategy is learned to
mitigate attacks and recover the quadrotor's tracking performance. We base our
work on Agilicious, a state-of-the-art quadrotor recently deployed for
autonomous settings. This paper is the first in the United Kingdom to deploy
this quadrotor and implement reinforcement learning on its platform. Therefore,
to promote easy reproducibility with minimal engineering overhead, we further
provide (1) a comprehensive breakdown of this quadrotor, including software
stacks and hardware alternatives; (2) a detailed reinforcement-learning
framework to train autonomous controllers on Agilicious agents; and (3) a new
open-source environment that builds upon PyFlyt for future reinforcement
learning research on Agilicious platforms. Both simulated and real-world
experiments are conducted to show the effectiveness of the proposed frameworks
in section 5.2.
comment: The paper is based on an undergraduate thesis and is not intended for
publication in a journal
☆ SpotLight: Robotic Scene Understanding through Interaction and Affordance Detection
Despite increasing research efforts on household robotics, robots intended
for deployment in domestic settings still struggle with more complex tasks such
as interacting with functional elements like drawers or light switches, largely
due to limited task-specific understanding and interaction capabilities. These
tasks require not only detection and pose estimation but also an understanding
of the affordances these elements provide. To address these challenges and
enhance robotic scene understanding, we introduce SpotLight: A comprehensive
framework for robotic interaction with functional elements, specifically light
switches. Furthermore, this framework enables robots to improve their
environmental understanding through interaction. Leveraging VLM-based
affordance prediction to estimate motion primitives for light switch
interaction, we achieve up to 84% operation success in real world experiments.
We further introduce a specialized dataset containing 715 images as well as a
custom detection model for light switch detection. We demonstrate how the
framework can facilitate robot learning through physical interaction by having
the robot explore the environment and discover previously unknown relationships
in a scene graph representation. Lastly, we propose an extension to the
framework to accommodate other functional interactions such as swing doors,
showcasing its flexibility. Videos and Code:
timengelbracht.github.io/SpotLight/
comment: timengelbracht.github.io/SpotLight/
☆ Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation
Large Language Models (LLMs) have gained popularity in task planning for
long-horizon manipulation tasks. To enhance the validity of LLM-generated
plans, visual demonstrations and online videos have been widely employed to
guide the planning process. However, for manipulation tasks involving subtle
movements but rich contact interactions, visual perception alone may be
insufficient for the LLM to fully interpret the demonstration. Additionally,
visual data provides limited information on force-related parameters and
conditions, which are crucial for effective execution on real robots.
In this paper, we introduce an in-context learning framework that
incorporates tactile and force-torque information from human demonstrations to
enhance LLMs' ability to generate plans for new task scenarios. We propose a
bootstrapped reasoning pipeline that sequentially integrates each modality into
a comprehensive task plan. This task plan is then used as a reference for
planning in new task configurations. Real-world experiments on two different
sequential manipulation tasks demonstrate the effectiveness of our framework in
improving LLMs' understanding of multi-modal demonstrations and enhancing the
overall planning performance.
☆ Physically-Based Photometric Bundle Adjustment in Non-Lambertian Environments IROS 2024
Lei Cheng, Junpeng Hu, Haodong Yan, Mariia Gladkova, Tianyu Huang, Yun-Hui Liu, Daniel Cremers, Haoang Li
Photometric bundle adjustment (PBA) is widely used in estimating the camera
pose and 3D geometry by assuming a Lambertian world. However, the assumption of
photometric consistency is often violated since the non-diffuse reflection is
common in real-world environments. The photometric inconsistency significantly
affects the reliability of existing PBA methods. To solve this problem, we
propose a novel physically-based PBA method. Specifically, we introduce the
physically-based weights regarding material, illumination, and light path.
These weights distinguish the pixel pairs with different levels of photometric
inconsistency. We also design corresponding models for material estimation
based on sequential images and illumination estimation based on point clouds.
In addition, we establish the first SLAM-related dataset of non-Lambertian
scenes with complete ground truth of illumination and material. Extensive
experiments demonstrated that our PBA method outperforms existing approaches in
accuracy.
comment: Accepted to 2024 IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2024)
☆ XP-MARL: Auxiliary Prioritization in Multi-Agent Reinforcement Learning to Address Non-Stationarity
Non-stationarity poses a fundamental challenge in Multi-Agent Reinforcement
Learning (MARL), arising from agents simultaneously learning and altering their
policies. This creates a non-stationary environment from the perspective of
each individual agent, often leading to suboptimal or even unconverged learning
outcomes. We propose an open-source framework named XP-MARL, which augments
MARL with auxiliary prioritization to address this challenge in cooperative
settings. XP-MARL is 1) founded upon our hypothesis that prioritizing agents
and letting higher-priority agents establish their actions first would
stabilize the learning process and thus mitigate non-stationarity and 2)
enabled by our proposed mechanism called action propagation, where
higher-priority agents act first and communicate their actions, providing a
more stationary environment for others. Moreover, instead of using a predefined
or heuristic priority assignment, XP-MARL learns priority-assignment policies
with an auxiliary MARL problem, leading to a joint learning scheme. Experiments
in a motion-planning scenario involving Connected and Automated Vehicles (CAVs)
demonstrate that XP-MARL improves the safety of a baseline model by 84.4% and
outperforms a state-of-the-art approach, which improves the baseline by only
12.8%. Code: github.com/cas-lab-munich/sigmarl
comment: 7 pages, 5 figures. This work has been submitted to the IEEE for
possible publication. Copyright may be transferred without notice, after
which this version may no longer be accessible
☆ RaggeDi: Diffusion-based State Estimation of Disordered Rags, Sheets, Towels and Blankets
Cloth state estimation is an important problem in robotics. It is essential
for the robot to know the accurate state to manipulate cloth and execute tasks
such as robotic dressing, stitching, and covering/uncovering human beings.
However, estimating cloth state accurately remains challenging due to its high
flexibility and self-occlusion. This paper proposes a diffusion model-based
pipeline that formulates the cloth state estimation as an image generation
problem by representing the cloth state as an RGB image that describes the
point-wise translation (translation map) between a pre-defined flattened mesh
and the deformed mesh in a canonical space. Then we train a conditional
diffusion-based image generation model to predict the translation map based on
an observation. Experiments are conducted in both simulation and the real world
to validate the performance of our method. Results indicate that our method
outperforms two recent methods in both accuracy and speed.
☆ RoboMorph: In-Context Meta-Learning for Robot Dynamics Modeling
Manuel Bianchi Bazzi, Asad Ali Shahid, Christopher Agia, John Alora, Marco Forgione, Dario Piga, Francesco Braghin, Marco Pavone, Loris Roveda
The landscape of Deep Learning has experienced a major shift with the
pervasive adoption of Transformer-based architectures, particularly in Natural
Language Processing (NLP). Novel avenues for physical applications, such as
solving Partial Differential Equations and Image Vision, have been explored.
However, in challenging domains like robotics, where high non-linearity poses
significant challenges, Transformer-based applications are scarce. While
Transformers have been used to provide robots with knowledge about high-level
tasks, few efforts have been made to perform system identification. This paper
proposes a novel methodology to learn a meta-dynamical model of a
high-dimensional physical system, such as the Franka robotic arm, using a
Transformer-based architecture without prior knowledge of the system's physical
parameters. The objective is to predict quantities of interest (end-effector
pose and joint positions) given the torque signals for each joint. This
prediction can be useful as a component for Deep Model Predictive Control
frameworks in robotics. The meta-model establishes the correlation between
torques and positions and predicts the output for the complete trajectory. This
work provides empirical evidence of the efficacy of the in-context learning
paradigm, suggesting future improvements in learning the dynamics of robotic
systems without explicit knowledge of physical parameters. Code, videos, and
supplementary materials can be found at project website. See
https://sites.google.com/view/robomorph/
☆ Hook-Based Aerial Payload Grasping from a Moving Platform
This paper investigates payload grasping from a moving platform using a
hook-equipped aerial manipulator. First, a computationally efficient trajectory
optimization based on complementarity constraints is proposed to determine the
optimal grasping time. To enable application in complex, dynamically changing
environments, the future motion of the payload is predicted using physics
simulator-based models. The success of payload grasping under model
uncertainties and external disturbances is formally verified through a
robustness analysis method based on integral quadratic constraints. The
proposed algorithms are evaluated in a high-fidelity physical simulator, and in
real flight experiments using a custom-designed aerial manipulator platform.
☆ One Map to Find Them All: Real-time Open-Vocabulary Mapping for Zero-shot Multi-Object Navigation
The capability to efficiently search for objects in complex environments is
fundamental for many real-world robot applications. Recent advances in
open-vocabulary vision models have resulted in semantically-informed object
navigation methods that allow a robot to search for an arbitrary object without
prior training. However, these zero-shot methods have so far treated the
environment as unknown for each consecutive query. In this paper we introduce a
new benchmark for zero-shot multi-object navigation, allowing the robot to
leverage information gathered from previous searches to more efficiently find
new objects. To address this problem we build a reusable open-vocabulary
feature map tailored for real-time object search. We further propose a
probabilistic-semantic map update that mitigates common sources of errors in
semantic feature extraction and leverage this semantic uncertainty for informed
multi-object exploration. We evaluate our method on a set of object navigation
tasks in both simulation as well as with a real robot, running in real-time on
a Jetson Orin AGX. We demonstrate that it outperforms existing state-of-the-art
approaches both on single and multi-object navigation tasks. Additional videos,
code and the multi-object navigation benchmark will be available on
https://finnbsch.github.io/OneMap.
☆ RockTrack: A 3D Robust Multi-Camera-Ken Multi-Object Tracking Framework
3D Multi-Object Tracking (MOT) obtains significant performance improvements
with the rapid advancements in 3D object detection, particularly in
cost-effective multi-camera setups. However, the prevalent end-to-end training
approach for multi-camera trackers results in detector-specific models,
limiting their versatility. Moreover, current generic trackers overlook the
unique features of multi-camera detectors, i.e., the unreliability of motion
observations and the feasibility of visual information. To address these
challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors.
Following the Tracking-By-Detection framework, RockTrack is compatible with
various off-the-shelf detectors. RockTrack incorporates a confidence-guided
preprocessing module to extract reliable motion and image observations from
distinct representation spaces from a single detector. These observations are
then fused in an association module that leverages geometric and appearance
cues to minimize mismatches. The resulting matches are propagated through a
staged estimation process, forming the basis for heuristic noise modeling.
Additionally, we introduce a novel appearance similarity metric for explicitly
characterizing object affinities in multi-camera settings. RockTrack achieves
state-of-the-art performance on the nuScenes vision-only tracking leaderboard
with 59.1% AMOTA while demonstrating impressive computational efficiency.
comment: RockTrack establishes a new state-of-the-art with 59.1% AMOTA on the
nuScenes vision-only test leaderboard with ResNet50-level backbone
☆ Multi-robot connection towards collective obstacle field traversal
Environments with large terrain height variations present great challenges
for legged robot locomotion. Drawing inspiration from fire ants' collective
assembly behavior, we study strategies that can enable two ``connectable''
robots to collectively navigate over bumpy terrains with height variations
larger than robot leg length. Each robot was designed to be extremely simple,
with a cubical body and one rotary motor actuating four vertical peg legs that
move in pairs. Two or more robots could physically connect to one another to
enhance collective mobility. We performed locomotion experiments with a
two-robot group, across an obstacle field filled with uniformly-distributed
semi-spherical ``boulders''. Experimentally-measured robot speed suggested that
the connection length between the robots has a significant effect on collective
mobility: connection length C in [0.86, 0.9] robot unit body length (UBL) were
able to produce sustainable movements across the obstacle field, whereas
connection length C in [0.63, 0.84] and [0.92, 1.1] UBL resulted in low
traversability. An energy landscape based model revealed the underlying
mechanism of how connection length modulated collective mobility through the
system's potential energy landscape, and informed adaptation strategies for the
two-robot system to adapt their connection length for traversing obstacle
fields with varying spatial frequencies. Our results demonstrated that by
varying the connection configuration between the robots, the two-robot system
could leverage mechanical intelligence to better utilize obstacle interaction
forces and produce improved locomotion. Going forward, we envision that
generalized principles of robot-environment coupling can inform design and
control strategies for a large group of small robots to achieve ant-like
collective environment negotiation.
☆ Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects
Human cognition can leverage fundamental conceptual knowledge, like geometric
and kinematic ones, to appropriately perceive, comprehend and interact with
novel objects. Motivated by this finding, we aim to endow machine intelligence
with an analogous capability through performing at the conceptual level, in
order to understand and then interact with articulated objects, especially for
those in novel categories, which is challenging due to the intricate geometric
structures and diverse joint types of articulated objects. To achieve this
goal, we propose Analytic Ontology Template (AOT), a parameterized and
differentiable program description of generalized conceptual ontologies. A
baseline approach called AOTNet driven by AOTs is designed accordingly to equip
intelligent agents with these generalized concepts, and then empower the agents
to effectively discover the conceptual knowledge on the structure and
affordance of articulated objects. The AOT-driven approach yields benefits in
three key perspectives: i) enabling concept-level understanding of articulated
objects without relying on any real training data, ii) providing analytic
structure information, and iii) introducing rich affordance information
indicating proper ways of interaction. We conduct exhaustive experiments and
the results demonstrate the superiority of our approach in understanding and
then interacting with articulated objects.
☆ RMP-YOLO: A Robust Motion Predictor for Partially Observable Scenarios even if You Only Look Once
Jiawei Sun, Jiahui Li, Tingchen Liu, Chengran Yuan, Shuo Sun, Zefan Huang, Anthony Wong, Keng Peng Tee, Marcelo H. Ang Jr
We introduce RMP-YOLO, a unified framework designed to provide robust motion
predictions even with incomplete input data. Our key insight stems from the
observation that complete and reliable historical trajectory data plays a
pivotal role in ensuring accurate motion prediction. Therefore, we propose a
new paradigm that prioritizes the reconstruction of intact historical
trajectories before feeding them into the prediction modules. Our approach
introduces a novel scene tokenization module to enhance the extraction and
fusion of spatial and temporal features. Following this, our proposed recovery
module reconstructs agents' incomplete historical trajectories by leveraging
local map topology and interactions with nearby agents. The reconstructed,
clean historical data is then integrated into the downstream prediction
modules. Our framework is able to effectively handle missing data of varying
lengths and remains robust against observation noise, while maintaining high
prediction accuracy. Furthermore, our recovery module is compatible with
existing prediction models, ensuring seamless integration. Extensive
experiments validate the effectiveness of our approach, and deployment in
real-world autonomous vehicles confirms its practical utility. In the 2024
Waymo Motion Prediction Competition, our method, RMP-YOLO, achieves
state-of-the-art performance, securing third place.
☆ From Words to Wheels: Automated Style-Customized Policy Generation for Autonomous Driving
Autonomous driving technology has witnessed rapid advancements, with
foundation models improving interactivity and user experiences. However,
current autonomous vehicles (AVs) face significant limitations in delivering
command-based driving styles. Most existing methods either rely on predefined
driving styles that require expert input or use data-driven techniques like
Inverse Reinforcement Learning to extract styles from driving data. These
approaches, though effective in some cases, face challenges: difficulty
obtaining specific driving data for style matching (e.g., in Robotaxis),
inability to align driving style metrics with user preferences, and limitations
to pre-existing styles, restricting customization and generalization to new
commands. This paper introduces Words2Wheels, a framework that automatically
generates customized driving policies based on natural language user commands.
Words2Wheels employs a Style-Customized Reward Function to generate a
Style-Customized Driving Policy without relying on prior driving data. By
leveraging large language models and a Driving Style Database, the framework
efficiently retrieves, adapts, and generalizes driving styles. A Statistical
Evaluation module ensures alignment with user preferences. Experimental results
demonstrate that Words2Wheels outperforms existing methods in accuracy,
generalization, and adaptability, offering a novel solution for customized AV
driving behavior. Code and demo available at
https://yokhon.github.io/Words2Wheels/.
comment: 6 pages, 7 figures
☆ SLAM assisted 3D tracking system for laparoscopic surgery
A major limitation of minimally invasive surgery is the difficulty in
accurately locating the internal anatomical structures of the target organ due
to the lack of tactile feedback and transparency. Augmented reality (AR) offers
a promising solution to overcome this challenge. Numerous studies have shown
that combining learning-based and geometric methods can achieve accurate
preoperative and intraoperative data registration. This work proposes a
real-time monocular 3D tracking algorithm for post-registration tasks. The
ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The
primitive 3D shape is used for fast initialization of the monocular SLAM. A
pseudo-segmentation strategy is employed to separate the target organ from the
background for tracking purposes, and the geometric prior of the 3D shape is
incorporated as an additional constraint in the pose graph. Experiments from
in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system
provides robust 3D tracking and effectively handles typical challenges such as
fast motion, out-of-field-of-view scenarios, partial visibility, and
"organ-background" relative motion.
comment: Demo: https://youtu.be/B1xZW8bj3cM
☆ Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning
The intricate nature of real-world driving environments, characterized by
dynamic and diverse interactions among multiple vehicles and their possible
future states, presents considerable challenges in accurately predicting the
motion states of vehicles and handling the uncertainty inherent in the
predictions. Addressing these challenges requires comprehensive modeling and
reasoning to capture the implicit relations among vehicles and the
corresponding diverse behaviors. This research introduces an integrated
framework for autonomous vehicles (AVs) motion prediction to address these
complexities, utilizing a novel Relational Hypergraph Interaction-informed
Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational
reasoning by integrating a multi-scale hypergraph neural network to model
group-wise interactions among multiple vehicles and their multi-modal driving
behaviors, thereby enhancing motion prediction accuracy and reliability.
Experimental validation using real-world datasets demonstrates the superior
performance of this framework in improving predictive accuracy and fostering
socially aware automated driving in dynamic traffic scenarios.
☆ Learning-accelerated A* Search for Risk-aware Path Planning SC
Safety is a critical concern for urban flights of autonomous Unmanned Aerial
Vehicles. In populated environments, risk should be accounted for to produce an
effective and safe path, known as risk-aware path planning. Risk-aware path
planning can be modeled as a Constrained Shortest Path (CSP) problem, aiming to
identify the shortest possible route that adheres to specified safety
thresholds. CSP is NP-hard and poses significant computational challenges.
Although many traditional methods can solve it accurately, all of them are very
slow. Our method introduces an additional safety dimension to the traditional
A* (called ASD A*), enabling A* to handle CSP. Furthermore, we develop a custom
learning-based heuristic using transformer-based neural networks, which
significantly reduces the computational load and improves the performance of
the ASD A* algorithm. The proposed method is well-validated with both random
and realistic simulation scenarios.
comment: AIAA SCITECH 2024 Forum
♻ ☆ AnySkin: Plug-and-play Skin Sensing for Robotic Touch
While tactile sensing is widely accepted as an important and useful sensing
modality, its use pales in comparison to other sensory modalities like vision
and proprioception. AnySkin addresses the critical challenges that impede the
use of tactile sensing -- versatility, replaceability, and data reusability.
Building on the simplistic design of ReSkin, and decoupling the sensing
electronics from the sensing interface, AnySkin simplifies integration making
it as straightforward as putting on a phone case and connecting a charger.
Furthermore, AnySkin is the first uncalibrated tactile-sensor with
cross-instance generalizability of learned manipulation policies. To summarize,
this work makes three key contributions: first, we introduce a streamlined
fabrication process and a design tool for creating an adhesive-free, durable
and easily replaceable magnetic tactile sensor; second, we characterize slip
detection and policy learning with the AnySkin sensor; and third, we
demonstrate zero-shot generalization of models trained on one instance of
AnySkin to new instances, and compare it with popular existing tactile
solutions like DIGIT and ReSkin.https://any-skin.github.io/
♻ ☆ TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes ICRA2025
In this paper, we present a new approach to bridge the domain gap between
synthetic and real-world data for unmanned aerial vehicle (UAV)-based
perception. Our formulation is designed for dynamic scenes, consisting of small
moving objects or human actions. We propose an extension of K-Planes Neural
Radiance Field (NeRF), wherein our algorithm stores a set of tiered feature
vectors. The tiered feature vectors are generated to effectively model
conceptual information about a scene as well as an image decoder that
transforms output feature maps into RGB images. Our technique leverages the
information amongst both static and dynamic objects within a scene and is able
to capture salient scene attributes of high altitude videos. We evaluate its
performance on challenging datasets, including Okutama Action and UG2, and
observe considerable improvement in accuracy over state of the art neural
rendering methods.
comment: 8 pages, submitted to ICRA2025
♻ ☆ Autonomous Navigation in Ice-Covered Waters with Learned Predictions on Ship-Ice Interactions
Autonomous navigation in ice-covered waters poses significant challenges due
to the frequent lack of viable collision-free trajectories. When complete
obstacle avoidance is infeasible, it becomes imperative for the navigation
strategy to minimize collisions. Additionally, the dynamic nature of ice, which
moves in response to ship maneuvers, complicates the path planning process. To
address these challenges, we propose a novel deep learning model to estimate
the coarse dynamics of ice movements triggered by ship actions through
occupancy estimation. To ensure real-time applicability, we propose a novel
approach that caches intermediate prediction results and seamlessly integrates
the predictive model into a graph search planner. We evaluate the proposed
planner both in simulation and in a physical testbed against existing
approaches and show that our planner significantly reduces collisions with ice
when compared to the state-of-the-art. Codes and demos of this work are
available at https://github.com/IvanIZ/predictive-asv-planner.
♻ ☆ Uncovering the Secrets of Human-Like Movement: A Fresh Perspective on Motion Planning
This article explores human-like movement from a fresh perspective on motion
planning. We analyze the coordinated and compliant movement mechanisms of the
human body from the perspective of biomechanics. Based on these mechanisms, we
propose an optimal control framework that integrates compliant control
dynamics, optimizing robotic arm motion through a response time matrix. This
matrix sets the timing parameters for joint movements, turning the system into
a time-parameterized optimal control problem. The model focuses on the
interaction between active and passive joints under external disturbances,
improving adaptability and compliance. This method achieves optimal trajectory
generation and balances precision and compliance. Experimental results on both
a manipulator and a humanoid robot validate the approach.
comment: 7 pages
♻ ☆ Checklist to Define the Identification of TP, FP, and FN Object Detections in Automated Driving
The object perception of automated driving systems must pass quality and
robustness tests before a safe deployment. Such tests typically identify true
positive (TP), false-positive (FP), and false-negative (FN) detections and
aggregate them to metrics. Since the literature seems to be lacking a
comprehensive way to define the identification of TPs/FPs/FNs, this paper
provides a checklist of relevant functional aspects and implementation details.
Besides labeling policies of the test set, we cover areas of vision, occlusion
handling, safety-relevant areas, matching criteria, temporal and probabilistic
issues, and further aspects. Even though the checklist cannot be fully
formalized, it can help practitioners minimize the ambiguity of their tests,
which, in turn, makes statements on object perception more reliable and
comparable.
comment: This version improves the checklist's usability by providing bullet
points to follow. It also condenses the contributions to safety assurance
down to the "Related Work" section. 11 pages, 3 figures
♻ ☆ ViewActive: Active viewpoint optimization from a single image
When observing objects, humans benefit from their spatial visualization and
mental rotation ability to envision potential optimal viewpoints based on the
current observation. This capability is crucial for enabling robots to achieve
efficient and robust scene perception during operation, as optimal viewpoints
provide essential and informative features for accurately representing scenes
in 2D images, thereby enhancing downstream tasks.
To endow robots with this human-like active viewpoint optimization
capability, we propose ViewActive, a modernized machine learning approach
drawing inspiration from aspect graph, which provides viewpoint optimization
guidance based solely on the current 2D image input. Specifically, we introduce
the 3D Viewpoint Quality Field (VQF), a compact and consistent representation
for viewpoint quality distribution similar to an aspect graph, composed of
three general-purpose viewpoint quality metrics: self-occlusion ratio,
occupancy-aware surface normal entropy, and visual entropy. We utilize
pre-trained image encoders to extract robust visual and semantic features,
which are then decoded into the 3D VQF, allowing our model to generalize
effectively across diverse objects, including unseen categories.The lightweight
ViewActive network (72 FPS on a single GPU) significantly enhances the
performance of state-of-the-art object recognition pipelines and can be
integrated into real-time motion planning for robotic applications. Our code
and dataset are available here: https://github.com/jiayi-wu-umd/ViewActive
♻ ☆ Dynamic Gap: Safe Gap-based Navigation in Dynamic Environments
This paper extends the family of gap-based local planners to unknown dynamic
environments through generating provable collision-free properties for
hierarchical navigation systems. Existing perception-informed local planners
that operate in dynamic environments rely on emergent or empirical robustness
for collision avoidance as opposed to performing formal analysis of dynamic
obstacles. In addition to this, the obstacle tracking that is performed in
these existent planners is often achieved with respect to a global inertial
frame, subjecting such tracking estimates to transformation errors from
odometry drift. The proposed local planner, dynamic gap, shifts the tracking
paradigm to modeling how the free space, represented as gaps, evolves over
time. Gap crossing and closing conditions are developed to aid in determining
the feasibility of passage through gaps, and a breadth of simulation
benchmarking is performed against other navigation planners in the literature
where the proposed dynamic gap planner achieves the highest success rate out of
all planners tested in all environments.
comment: Under review
♻ ☆ Annealed Winner-Takes-All for Motion Forecasting
In autonomous driving, motion prediction aims at forecasting the future
trajectories of nearby agents, helping the ego vehicle to anticipate behaviors
and drive safely. A key challenge is generating a diverse set of future
predictions, commonly addressed using data-driven models with Multiple Choice
Learning (MCL) architectures and Winner-Takes-All (WTA) training objectives.
However, these methods face initialization sensitivity and training
instabilities. Additionally, to compensate for limited performance, some
approaches rely on training with a large set of hypotheses, requiring a
post-selection step during inference to significantly reduce the number of
predictions. To tackle these issues, we take inspiration from annealed MCL, a
recently introduced technique that improves the convergence properties of MCL
methods through an annealed Winner-Takes-All loss (aWTA). In this paper, we
demonstrate how the aWTA loss can be integrated with state-of-the-art motion
forecasting models to enhance their performance using only a minimal set of
hypotheses, eliminating the need for the cumbersome post-selection step. Our
approach can be easily incorporated into any trajectory prediction model
normally trained using WTA and yields significant improvements. To facilitate
the application of our approach to future motion forecasting models, the code
will be made publicly available upon acceptance:
https://github.com/valeoai/MF_aWTA.
comment: 7 pages, 6 figures
♻ ☆ BEATLE -- Self-Reconfigurable Aerial Robot: Design, Control and Experimental Validation
Modular self-reconfigurable robots (MSRRs) offer enhanced task flexibility by
constructing various structures suitable for each task. However, conventional
terrestrial MSRRs equipped with wheels face critical challenges, including
limitations in the size of constructible structures and system robustness due
to elevated wrench loads applied to each module. In this work, we introduce an
Aerial MSRR (A-MSRR) system named BEATLE, capable of merging and separating
in-flight. BEATLE can merge without applying wrench loads to adjacent modules,
thereby expanding the scalability and robustness of conventional terrestrial
MSRRs. In this article, we propose a system configuration for BEATLE, including
mechanical design, a control framework for multi-connected flight, and a motion
planner for reconfiguration motion. The design of a docking mechanism and
housing structure aims to balance the durability of the constructed structure
with ease of separation. Furthermore, the proposed flight control framework
achieves stable multi-connected flight based on contact wrench control.
Moreover, the proposed motion planner based on a finite state machine (FSM)
achieves precise and robust reconfiguration motion. We also introduce the
actual implementation of the prototype and validate the robustness and
scalability of the proposed system design through experiments and simulation
studies.
♻ ☆ RiskMap: A Unified Driving Context Representation for Autonomous Motion Planning in Urban Driving Environment
Motion planning is a complicated task that requires the combination of
perception, map information integration and prediction, particularly when
driving in heavy traffic. Developing an extensible and efficient representation
that visualizes sensor noise and provides basis to real-time planning tasks is
desirable. We aim to develop an interpretable map representation, which offers
prior of driving cost in planning tasks. In this way, we can simplify the
planning process for dealing with complex driving scenarios and visualize
sensor noise. Specifically, we propose a unified context representation
empowered by deep neural networks. The unified representation is a
differentiable risk field, which is an analytical representation of statistical
cognition regarding traffic participants for downstream planning tasks. This
representation method is nominated as RiskMap. A sampling-based planner is
adopted to train and compare RiskMap generation methods. In this paper, the
RiskMap generation tools and model structures are explored, the results
illustrate that our method can improve driving safety and smoothness, and the
limitation of our method is also discussed.
comment: Under review
♻ ☆ A Generic Trajectory Planning Method for Constrained All-Wheel-Steering Robots
This paper presents a generic trajectory planning method for wheeled robots
with fixed steering axes while the steering angle of each wheel is constrained.
In the existing literatures, All-Wheel-Steering (AWS) robots, incorporating
modes such as rotation-free translation maneuvers, in-situ rotational
maneuvers, and proportional steering, exhibit inefficient performance due to
time-consuming mode switches. This inefficiency arises from wheel rotation
constraints and inter-wheel cooperation requirements. The direct application of
a holonomic moving strategy can lead to significant slip angles or even
structural failure. Additionally, the limited steering range of AWS wheeled
robots exacerbates non-linearity characteristics, thereby complicating control
processes. To address these challenges, we developed a novel planning method
termed Constrained AWS (C-AWS), which integrates second-order discrete search
with predictive control techniques. Experimental results demonstrate that our
method adeptly generates feasible and smooth trajectories for C-AWS while
adhering to steering angle constraints.
comment: Accepted by iROS 2024
♻ ☆ 3DGS-Calib: 3D Gaussian Splatting for Multimodal SpatioTemporal Calibration IROS 2024
Quentin Herau, Moussab Bennehar, Arthur Moreau, Nathan Piasco, Luis Roldao, Dzmitry Tsishkou, Cyrille Migniot, Pascal Vasseur, Cédric Demonceaux
Reliable multimodal sensor fusion algorithms require accurate spatiotemporal
calibration. Recently, targetless calibration techniques based on implicit
neural representations have proven to provide precise and robust results.
Nevertheless, such methods are inherently slow to train given the high
computational overhead caused by the large number of sampled points required
for volume rendering. With the recent introduction of 3D Gaussian Splatting as
a faster alternative to implicit representation methods, we propose to leverage
this new rendering approach to achieve faster multi-sensor calibration. We
introduce 3DGS-Calib, a new calibration method that relies on the speed and
rendering accuracy of 3D Gaussian Splatting to achieve multimodal
spatiotemporal calibration that is accurate, robust, and with a substantial
speed-up compared to methods relying on implicit neural representations. We
demonstrate the superiority of our proposal with experimental results on
sequences from KITTI-360, a widely used driving dataset.
comment: Accepted at IROS 2024 (Oral presentation). Project page:
https://qherau.github.io/3DGS-Calib/
♻ ☆ AirSLAM: An Efficient and Illumination-Robust Point-Line Visual SLAM System
In this paper, we present an efficient visual SLAM system designed to tackle
both short-term and long-term illumination challenges. Our system adopts a
hybrid approach that combines deep learning techniques for feature detection
and matching with traditional backend optimization methods. Specifically, we
propose a unified convolutional neural network (CNN) that simultaneously
extracts keypoints and structural lines. These features are then associated,
matched, triangulated, and optimized in a coupled manner. Additionally, we
introduce a lightweight relocalization pipeline that reuses the built map,
where keypoints, lines, and a structure graph are used to match the query frame
with the map. To enhance the applicability of the proposed system to real-world
robots, we deploy and accelerate the feature detection and matching networks
using C++ and NVIDIA TensorRT. Extensive experiments conducted on various
datasets demonstrate that our system outperforms other state-of-the-art visual
SLAM systems in illumination-challenging environments. Efficiency evaluations
show that our system can run at a rate of 73Hz on a PC and 40Hz on an embedded
platform.
comment: 19 pages, 14 figures
♻ ☆ Bayesian Optimal Experimental Design for Robot Kinematic Calibration
This paper develops a Bayesian optimal experimental design for robot
kinematic calibration on ${\mathbb{S}^3 \!\times\! \mathbb{R}^3}$. Our method
builds upon a Gaussian process approach that incorporates a geometry-aware
kernel based on Riemannian Mat\'ern kernels over ${\mathbb{S}^3}$. To learn the
forward kinematics errors via Bayesian optimization with a Gaussian process, we
define a geodesic distance-based objective function. Pointwise values of this
function are sampled via noisy measurements taken through fiducial markers on
the end-effector using a camera and computed pose with the nominal kinematics.
The corrected Denavit-Hartenberg parameters are obtained using an efficient
quadratic program that operates on the collected data sets. The effectiveness
of the proposed method is demonstrated via simulations and calibration
experiments on NASA's ocean world lander autonomy testbed (OWLAT).
♻ ☆ VascularPilot3D: Toward a 3D fully autonomous navigation for endovascular robotics
Jingwei Song, Keke Yang, Han Chen, Jiayi Liu, Yinan Gu, Qianxin Hui, Yanqi Huang, Meng Li, Zheng Zhang, Tuoyu Cao, Maani Ghaffari
This research reports VascularPilot3D, the first 3D fully autonomous
endovascular robot navigation system. As an exploration toward autonomous
guidewire navigation, VascularPilot3D is developed as a complete navigation
system based on intra-operative imaging systems (fluoroscopic X-ray in this
study) and typical endovascular robots. VascularPilot3D adopts previously
researched fast 3D-2D vessel registration algorithms and guidewire segmentation
methods as its perception modules. We additionally propose three modules: a
topology-constrained 2D-3D instrument end-point lifting method, a tree-based
fast path planning algorithm, and a prior-free endovascular navigation
strategy. VascularPilot3D is compatible with most mainstream endovascular
robots. Ex-vivo experiments validate that VascularPilot3D achieves 100% success
rate among 25 trials. It reduces the human surgeon's overall control loops by
18.38%. VascularPilot3D is promising for general clinical autonomous
endovascular navigations.
♻ ☆ RAnGE: Reachability Analysis for Guaranteed Ergodicity
This paper investigates performance guarantees on coverage-based ergodic
exploration methods in environments containing disturbances. Ergodic
exploration methods generate trajectories for autonomous robots such that time
spent in each area of the exploration space is proportional to the utility of
exploring in the area. We find that it is possible to use techniques from
reachability analysis to solve for optimal controllers that guarantee ergodic
coverage and are robust against disturbances. We formulate ergodic search as a
differential game between the controller optimizing for ergodicity and an
external disturbance, and we derive the reachability equations for ergodic
search using an extended-state Bolza-form transform of the ergodic problem.
Contributions include the computation of a continuous value function for the
ergodic exploration problem and the derivation of a controller that provides
guarantees for coverage under disturbances. Our approach leverages
neural-network-based methods to solve the reachability equations; we also
construct a robust model-predictive controller for comparison. Simulated and
experimental results demonstrate the efficacy of our approach for generating
robust ergodic trajectories for search and exploration on a 1D system with an
external disturbance force.
comment: 21 pages, 5 figures
♻ ☆ GCS*: Forward Heuristic Search on Implicit Graphs of Convex Sets
We consider large-scale, implicit-search-based solutions to Shortest Path
Problems on Graphs of Convex Sets (GCS). We propose GCS*, a forward heuristic
search algorithm that generalizes A* search to the GCS setting, where a
continuous-valued decision is made at each graph vertex, and constraints across
graph edges couple these decisions, influencing costs and feasibility. Such
mixed discrete-continuous planning is needed in many domains, including motion
planning around obstacles and planning through contact. This setting provides a
unique challenge for best-first search algorithms: the cost and feasibility of
a path depend on continuous-valued points chosen along the entire path. We show
that by pruning paths that are cost-dominated over their entire terminal
vertex, GCS* can search efficiently while still guaranteeing cost-optimality
and completeness. To find satisficing solutions quickly, we also present a
complete but suboptimal variation, pruning instead reachability-dominated
paths. We implement these checks using polyhedral-containment or sampling-based
methods. The former implementation is complete and cost-optimal, while the
latter is probabilistically complete and asymptotically cost-optimal and
performs effectively even with minimal samples in practice. We demonstrate GCS*
on planar pushing tasks where the combinatorial explosion of contact modes
renders prior methods intractable and show it performs favorably compared to
the state-of-the-art. Project website: https://shaoyuan.cc/research/gcs-star/
comment: Accepted to WAFR 2024. Conference Ready Version